in reply to Scaling Hash Limits

55 million items in your hash? One wonders where all this data is coming from. It will take quite a while just to read in that many items.

If you have to re-do the duplicate check a few times (say every-time new data is added to the set), perhaps now is the moment to store the data in a database and use the database-engine as your duplicate-detector.

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

My blog: Imperial Deltronics

Replies are listed 'Best First'.
Re^2: Scaling Hash Limits
by Endless (Beadle) on Sep 19, 2013 at 13:32 UTC
    I'm going through a stock of twitter messages. There are around 180 million total, and I want to make sure I don't have duplicates (based on 12-digit id numbers).
      12-digit id numbers

      Do you know the approximate lowest and highest numbers?

      If so, you can reduce the memory requirement for de-duping your 180 million ids from ~ 4 gigabytes to ~ 8 megabytes.

      And make it much faster also.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Hi BrowserUk, you're making a very interesting point, which nobody seems to have picked up so far. I can see two ways of significantly reducing the memory required for the task at hand. One is that we probably don't need to worry about 12-digit numbers if we can narrow down the ID range really needed to something of the order of, say, about 200 to 250 million numbers, which seems to be a reasonable hypothesis if the IDs are allocated sequentially by the system. Then once we have such a narrower range, we only basically need only one bit per number in the range, and we just need to set one bit if a particular number has already been seen (so 1 bit per number in the range). These two observations could drive the memory requirement to perhaps 30 megabytes, if I can still think clearly this late in the evening, but I can't see how to reduce this memory requirement to only 4 megabytes 8 megabytes. I would be grateful if you could enlighten me on this, and I am sure many others would benefit from these ideas.

        Update: corrected my error of inattention: BrowserUk said 8 megabytes, not 4 megabytes.

      Database.