in reply to Working with large amount of data

There is an implicit assumption within most of the responses in this thread that your talk of "ip adresses" is confined to IPv4 addresses. Is this a correct assumption?

I ask, because if correct, it is emminently feasible to consider using direct addressing to a 16GB file to hold 32-bit counts for each of the 4GB IPv4 addresses.This would require minimal memory, but reading-incrementing-writing 4 bytes within a 16GB file for each of 1 billion--essentially random--ips discovered would be too costly to contemplate. And far more so if the reads and writes are to any kind of DB.

The secret to making this work efficiently would be to

  1. Memory map the 16GB file.
  2. Cache the ips in memory as a sparse array (a hash in Perl's terms), as the are discovered, and only write them to disk when a given number of unique IPs have been counted.

    Once the cache limit is reached, you sort the uniquie IPs numerically and then read-update-write largish (say 64K) lumps of the file. You then reset (undef) the hash and continue.

By using memory mapped files, you let the virtual/physical address mapping capabilities of the OS and hardware take care of accessing the appropriate chunk of the file efficiently.

By caching the counts, you can control the amount of memory the process uses, whilst avoiding one-seek-per-IP scenario, which is the killer for disk-based solutions.

By sorting the accumulated IPs before reading and writing large chunks, you make best possible use of the systems filecaching and L1/l2/L3 caching to only write to real memory (or disk) when necessary.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP PCW It is as I've been saying!(Audio until 20090817)