in reply to Working with large amount of data
There is an implicit assumption within most of the responses in this thread that your talk of "ip adresses" is confined to IPv4 addresses. Is this a correct assumption?
I ask, because if correct, it is emminently feasible to consider using direct addressing to a 16GB file to hold 32-bit counts for each of the 4GB IPv4 addresses.This would require minimal memory, but reading-incrementing-writing 4 bytes within a 16GB file for each of 1 billion--essentially random--ips discovered would be too costly to contemplate. And far more so if the reads and writes are to any kind of DB.
The secret to making this work efficiently would be to
Once the cache limit is reached, you sort the uniquie IPs numerically and then read-update-write largish (say 64K) lumps of the file. You then reset (undef) the hash and continue.
By using memory mapped files, you let the virtual/physical address mapping capabilities of the OS and hardware take care of accessing the appropriate chunk of the file efficiently.
By caching the counts, you can control the amount of memory the process uses, whilst avoiding one-seek-per-IP scenario, which is the killer for disk-based solutions.
By sorting the accumulated IPs before reading and writing large chunks, you make best possible use of the systems filecaching and L1/l2/L3 caching to only write to real memory (or disk) when necessary.
|
|---|