in reply to How likely is rand() to repeat?

Mathmatically, I think the odds of any given pick being a duplicate of the previous pick are ~1/6.45e+44, which is vanishingly unlikely. Obviously, the more you pick the more likely a duplicate, but given the starting odds, even if you are picking trillions they are still very unlikely.

Something like 1e12*2(*)/6.45e44 ~= 1/3.22e32. (For reference, that about a million times the number of grams in the entire Earth!)

((*) For the Birthday paradox)

Of course, it also depends somewhat upon the validity of your rand(), but even using the notoriously poor rand() built-in to Win32 perl, I didn't get a single dup in 1e7 picks:

undef %h; ++$h{ join'', map $chars[ rand @chars ], 1 .. 25 } for 1 .. 1e7; print scalar keys %h;; 10000000

You could use a better rand, like Math::Random::MT, but it is probably unnecessary unless you are picking trillions. It is also much slower than the built-in on my machine.

Finally, no matter how good your rand(), even with those vanishingly small odds, there are no guarantees that you won't get a duplicate. The odds may be very small against it happening, but it still can happen. You are very, very unlikely to see it happen twice in your lifetime though.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Replies are listed 'Best First'.
Re^2: How likely is rand() to repeat?
by desertrat (Sexton) on Mar 08, 2012 at 21:50 UTC

    Ok, the numbers will unlikely even get into the millions, so I'm probably good. good enough that I (or rather my users) can live with the one DB error in my lifetime...

    Or I can belt&suspender it and re-run the key creating function if my DB throws a constraint violation...

    The OS in question is 64-bit Centos 6, for what that's worth.

      Or run this till the cows come home.

      perl -MData::UUID -le 'print Data::UUID->new->create_str'

      Or write a sequence generator.

      Update: for less predictability in recreating-

      perl -MData::UUID -le 'print Data::UUID->new->create_from_name_str("ta +blename?",rand())'
      Or I can belt&suspender it and re-run the key creating function if my DB throws a constraint violation...

      Indeed a wise precaution. If performance isn't an issue, using Math::Random::MT is another.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        "Just use a Mersenne Twister" doesn't solve the problem by itself. Of course, if all of the IDs are going to be generated by a single, long-running process then the typical MT would give one just under 20K bits of random state (which would solve the problem). But that seems quite unlikely, especially because such an arrangement would allow for a much simpler assignment of unique IDs.

        To solve the problem, you would (probably) have to "use a Mersenne Twister" and come up with a lot more bits of randomness for seed data each time a new process creates its first ID (and use all those random bits to initialize the MT).

        Now, "just use an MT" does help some if each process will be generating a lot of IDs. If, for example, you can convince yourself that only 2**-10 of the IDs will be "the first ID generated" by any given process, then the MT has, in effect, given you 10 more bits of randomness to help in avoiding ID collisions (up to the ~20k bits of state for the usual MT).

        So, a typical implementation of Perl with a rand() period of roughly 2**31 means base-62 IDs longer than 6 characters are just a waste. If we assume that Perl's own srand() actually is successful at picking seeds in a way that effectively covers the ~31-bit space of effective seed values (likely true on a system with /dev/urandom and a modern Perl, less true otherwise), then using long-running Perl interpreter instances to get only 2**-10 of the IDs being "first" IDs, then those 10 extra bits move us from 5.2-character IDs to 6.9-character IDs. So it would only be a waste if we produce IDs longer than 7 characters. Nowhere near making 25-character IDs effective.

        Coming up with reliably random bits is a harder problem for computers, one that is covered by /dev/random. So one could use /dev/random to get 20K (or at least 149) random bits to initialize each MT instance in order to solve the problem (if /dev/random is available). I suspect even getting the bits from /dev/urandom would solve this problem (as I think using just /dev/urandom directly to build the IDs would, in the worst case, act like using a single, shared MT instance).

        Or you could use your own persistent store like a file of "random" bytes that you carefully update each time you read seed data from it. Two processes reading the same seed data could, of course, present a problem. Blocking process initialization waiting for the seed file to be updated in single-file might present problems for scaling such a solution. So there are some minor challenges to be aware of. Or create a service to fetch IDs from.

        - tye