There is an implicit assumption within most of the responses in this thread that your talk of "ip adresses" is confined to IPv4 addresses. Is this a correct assumption?

I ask, because if correct, it is emminently feasible to consider using direct addressing to a 16GB file to hold 32-bit counts for each of the 4GB IPv4 addresses.This would require minimal memory, but reading-incrementing-writing 4 bytes within a 16GB file for each of 1 billion--essentially random--ips discovered would be too costly to contemplate. And far more so if the reads and writes are to any kind of DB.

The secret to making this work efficiently would be to

  1. Memory map the 16GB file.
  2. Cache the ips in memory as a sparse array (a hash in Perl's terms), as the are discovered, and only write them to disk when a given number of unique IPs have been counted.

    Once the cache limit is reached, you sort the uniquie IPs numerically and then read-update-write largish (say 64K) lumps of the file. You then reset (undef) the hash and continue.

By using memory mapped files, you let the virtual/physical address mapping capabilities of the OS and hardware take care of accessing the appropriate chunk of the file efficiently.

By caching the counts, you can control the amount of memory the process uses, whilst avoiding one-seek-per-IP scenario, which is the killer for disk-based solutions.

By sorting the accumulated IPs before reading and writing large chunks, you make best possible use of the systems filecaching and L1/l2/L3 caching to only write to real memory (or disk) when necessary.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP PCW It is as I've been saying!(Audio until 20090817)

In reply to Re: Working with large amount of data by BrowserUk
in thread Working with large amount of data by just1fix

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.