in reply to Working with large amount of data

Use a good database that enforces unique-key constraints, and use the IP addresses as keys. You may want to divide the problem into separate tables according to the first portion of the address, because it's easy to manage 100 or so tables, and handling the hashing (checking uniqueness) will be easier/faster when there's a smaller number of entries in a given table.

(And with the database solution, you have the option of actually storing something informative about each IP address, and more flexibility in pulling stuff out of the list once you're with the log file, in case that's helpful to you.)

UPDATE: CountZero's skepticism about my suggestions aside, I'll say now that I think my suggestions can be discarded in favor of the proposal described below by BrowserUK -- if his assumptions about the OP task are valid, his plan is profoundly more effective, efficient and satisfying.

Replies are listed 'Best First'.
Re^2: Working with large amount of data
by CountZero (Bishop) on Sep 21, 2009 at 06:17 UTC
    because it's easy to manage 100 or so tables, and handling the hashing (checking uniqueness) will be easier/faster when there's a smaller number of entries in a given table
    I do not agree.

    Doing some queries over data spread out over 100 tables is way more difficult (and slow) than doing the same query on a single table.

    And isn't hashing supposed to be working as fast for large or small addres spaces? The hashing function should be O(1), if I remember well.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      Hashing is O(1), but the constant involves a seek to memory. When the hash is too big to fit in RAM, that turns into a seek to disk. On a typical hard drive you should plan on an average of 0.01 seconds for the drive head to be correctly positioned for your read.

      If you do a few billion lookups each taking 0.01 seconds, you're talking about a year. That's usually not acceptable.

      Splitting up the problem into a series of subproblems that fit in RAM is a huge performance win. Despite the added complexity.

        That is true, but you will have to (re)load each hash over and over again when you need access to one of the 99 hashes which happen not to be in memory at the moment you need it.

        I'd rather trust the designers of the database to have managed this in a somewhat optimized way, rather than rolling my own cacheing system.

        I think 1 billion records with over 1 TB of data anyhow just cries for a professional database.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James