Divide and conquor. You say you have two fields that are the key, one of which is unique. Take the right hand digit of the number and sort your records into 10 files by that digit. (Insert hand waving about it probably working out that this means you end up with roughly even size output files.) Now do your dupe checks on the resulting files. The thing to remember about perl hashes is that they grow in powers of two, that is they double when they are too small. So divide your file sufficiently that you stay within reasonable bounds. Divide by 10 has worked for me with equivelent sized data loads.

There are other approaches to this like using DB_File or some kind of RDBMS but I actually think overall you will have a simpler and probably more efficient system if you just use some kind of approach to scale the data down. Splitting data into bite sized chunks is an ancient and honorable programming tradition. :-)

Oh, another approach is to use a Trie of some sort. If your accounts are dense then overall it can be a big winner in terms of space and is very efficient in terms of lookup.


---
demerphq

    First they ignore you, then they laugh at you, then they fight you, then you win.
    -- Gandhi



In reply to Re: Bloom::Filter Usage by demerphq
in thread Bloom::Filter Usage by jreades

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.