sdbarker has asked for the wisdom of the Perl Monks concerning the following question:

Alright, here's the skinny:

I have to generate a couple hundred million unique codes, that are 16 digits in length, upper-alphanumeric, (insert lots of criteria here). That entire process works perfectly fine, with one caveat; out of memory after I create about 34 million.

All of these codes are being stored in memory, in a hash, to maintain uniqueness. Is there any other method I could possibly use to maintain uniqueness, and not have to store these in memory? I've considered using a database backer, and making the database maintain the uniqueness for me. I haven't implemented that yet, though it seems as though it'd be significantly slower, and speed (since I'm generating 600mill of these) is quite key.

UPDATE:
The reason that I'm not going sequentially, or algorithmically, is that these codes are going to be printed on a product, and then collected and redeemed by consumers of that product, for various things. So, sequentially, or algorithmically, a consumer could redeem codes which they had not purchased the product for, by just using the next incremental number, or figuring out the algorithm.

So, what I've done, is I've just over-generated by a few million, am in the process of tossing them into a database, and I'll just delete what I don't need, or just let them sit there until later. Thanks everybody. Though, I'm still open to suggestions. :)

So, I'm open for ANY suggestions that anybody may have.

-Scott
  • Comment on Maintaining uniqueness across millions of items

Replies are listed 'Best First'.
Re: Maintaining uniqueness across millions of items
by dragonchild (Archbishop) on Aug 13, 2004 at 12:11 UTC
    it seems as though ...

    Back-of-a-napkin benchmarks are, well, worth about the same as a used napkin. I would suggest grabbing MySQL or PostgreSQL and trying it out. An RDBMS would certainly be the solution I would choose, given what you have described of the problem. (Not to mention that, at best, you're looking at needing 16 x 600M = 6.4B bytes, or 6.4gigabytes, even without any overhead.)

    ------
    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

    I shouldn't have to say this, but any code, unless otherwise stated, is untested

Re: Maintaining uniqueness across millions of items
by davorg (Chancellor) on Aug 13, 2004 at 12:19 UTC

    I'm probably missing something here, but is there any reason why you can't just allocate sequential numbers? That way you'd only need to store the next available value.

    --
    <http://www.dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

Re: Maintaining uniqueness across millions of items
by adrianh (Chancellor) on Aug 13, 2004 at 12:25 UTC

    What are your other constraints? I presume there must be a reason that you're not starting at 1 and counting upwards :-)

    Personally I use Data::UUID when I need unique IDs these days.

Re: Maintaining uniqueness across millions of items
by simon.proctor (Vicar) on Aug 13, 2004 at 12:26 UTC
    As an alternative to the suggestions above, you could also consider using a DBM file solution. Specifically the Berkeley DB from SleepyCat. It operates in the key/value system you already use and scales well.

    There is a CPAN module here so YMMV (look at the tests).
Re: Maintaining uniqueness across millions of items
by water (Deacon) on Aug 13, 2004 at 13:19 UTC
    If you get away with "probably unique", you can rely on statistics: by randomly selecting with replacement a small number of elements from a very large set, you can determine the likelihood that all are unique (something like (k-1)!/n^k for k pulls from a set of n).

    If you can't get away with "probably", you need to use something sequential (1, 2, 3, ...) or some invertible function of something sequential (f(1), f(2), f(3)...), if you are concerned that these "sequential" numbers don't appear to be sequential. (security-thru-obscurity, beware)

    Otherwise, you're stuck with remembering what you've already said: you need a database.

    An aside -- there are special approaches for storing "archival" data, where one only needs insert and select, not delete or update. Here's one random link to this idea.

Re: Maintaining uniqueness across millions of items
by Random_Walk (Prior) on Aug 13, 2004 at 12:20 UTC
    How about splitting them into groups based on first character ? If you have 26*2+10 possible first chars just generate 600/52 million for each initial letter and write them out before moving onto the next letter. I guess randomness is important here as you are not just generating the sequence 1...600M so when you hand them out first randomly select which block to get one from, them select a random one from that block (and mark it as used).

    Cheers.

Re: Maintaining uniqueness across millions of items
by deibyz (Hermit) on Aug 13, 2004 at 12:54 UTC
    I have to generate a couple hundred million unique codes, that are 16 digits in length, upper-alphanumeric, (insert lots of criteria here).

    Depending on how complex the criteria is, I would recomend you to find an algorith that generates all of them whithout repetition. If you post the criteria here I'm sure somebody is going to be able to help you.

    deibyz

Re: Maintaining uniqueness across millions of items
by Gilimanjaro (Hermit) on Aug 13, 2004 at 13:41 UTC
    First of all, you're talking about 600 million pieces of information, which amounts to 'large amounts of data' and databases are intented to maintain 'large amounts of data', so I'd recommend a database backend.

    Second of all, how do you generate the id's? If it can be sequential, 6 digits (0-9+A-Z) would give you over three times the number of id's you'd need.

    So I guess it depends on what other criteria the id's need to adhere to. Let us know; we're dying to find out... :)
Re: Maintaining uniqueness across millions of items
by ChuckularOne (Prior) on Aug 13, 2004 at 13:03 UTC
    There is always the possibility that you could generate the numbers and create a checkssm that is significantly smaller than 16 digits.

    Then you would only have to compare the checksums (and hold them in memory) for uniquness.

    The only caveat being that in a list of 600,000,000 keys you, most likely, will have some codes/numbers which are unique but have the same checksum.