in reply to How likely is rand() to repeat?

Well, there are 6225 == 645345427773512447880377451634304602899218432 different strings of length 25, over an alphabet of 62 characters. If you assume the rand() function was perfect, you want to solve the birthday paradox for a body that has 645345427773512447880377451634304602899218432 days in a year. If you generate N strings, a rough estimation of the probability that all of them are different is: e-N2/1290690855547024895760754903268609205798436864

But that's all irrelevant. Your random number generator is likely to only have 24, 32, 48 or 64 bits of entropy. Which means that most of the possible strings will never be generated. Even with 64 bits of entropy, there are only 18446744073709551616 strings possible. Which makes is far more likely for collisions to happen. In fact, Wikipedia tells us that with a 64 bit hash, there's a one procent chance you have a collision if you have 6.1E8 strings.

For more details, see MathWorld.

Replies are listed 'Best First'.
Re^2: How likely is rand() to repeat?
by BrowserUk (Patriarch) on Mar 08, 2012 at 23:55 UTC
    But that's all irrelevant. Your random number generator is likely to only have 24, 32, 48 or 64 bits of entropy. Which means that most of the possible strings will never be generated. Even with 64 bits of entropy, there are only 18446744073709551616 strings possible. Which makes is far more likely for collisions to happen.

    That would only be the case if he was only calling rand() once to pick from the 6.45e44 strings. But he is calling rand 25 times. Which as each pick is 'independant' of the previous and next picks, it increases the effective entropy.

    By way of proof. If you throw a die once, you can only get 1 .. 6; but throw it twice and you can get 1 .. 36.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      No. Perl's rand() is documented to be completely repeatable when given the same starting point (see srand). So the number of strings possible using just rand() is bound by roughly 2 ** (the number random bits) or, more correctly, the number of effectively different srand() values.

      Update: And every implementation that I've seen uses a simple LCPRNG (Linear Congruential Psuedo-Random Number Generator). So each has a fixed number of possible values and it will simply cycle through those in the exact same order each time if you actually call it 2**$bits times. The number of effectively different seed values is called the "period".

      Update: Oh, forgot to mention that some implementations intentionally use only a subset of the (upper) bits so the period can often be much higher than 2**$bits. For example, a period of 2**32 is quite common while many such implementations only use 16 bits of each result. And I'm not completely sure, in such a case, whether Perl Config reports that as 16 or 32 "random bits".

      Note: Just FYI, the above two updates were posted within a few minutes of the original submission (and thus nearly 2 hours before the first reply).

      - tye        

        So each has a fixed number of possible values and it will simply cycle through those in the exact same order each time if you actually call it 2**$bits times. The number of effectively different seed values is called the "period".

        Sorry tye, but you are very wrong on this, and I can prove it.

        The "period" is the number of values you have to draw before the entire sequence of 2**n-bits values, repeat in the same sequence, in their entirety!

        The Mersenne Twister MT19937, with 32-bit word length, has a period of 219937 - 1. How could that be if your statement above was true?

        Of course, that is not a LCPRNG.

        But, the rand() provided by MSVC is a linear congruential PRNG. And it only produce 15-bits on entropy. Proof:

        C:\test>perl -E"++$h{ int( rand 65536 ) } for 1 .. 1e6; say scalar key +s %h; grep{ $_ & 1 } keys %h or say 'No odd numbers found'" 32768 No odd numbers found

        And for the whole sequence to repeat, the first two values would have to repeat one after the other first. And according to you, that should happen within the first 32768 values produced.

        It doesn't. (With srand(1)):

        C:\test>randperiod -M=2 41 18467 i:1 n:412286284 First sequence of 2 values repeated itself after 412286284 calls to ra +nd

        That's 412 million values generated before the first pair are repeated in sequence.

        Now let's try to match the first 3 values produced:

        C:\test>randperiod -M=3 41 18467 6334 i:2 n:2147418117 First sequence of 3 values repeated itself after 2147418117 calls to r +and

        That's 2 billion values drawn before the first 3 values repeat in sequence.

        The test code:

        #! perl -slw use strict; sub rand32768{ int( rand 32768 ) } $|++; our $M //= 10; srand( 1 ); my @first = map rand32768(), 1 .. $M; print "@first"; my $n = $M; OUTER: while( 1 ) { ++$n until rand32768 == $first[ 0 ]; for my $i ( 1 .. $M - 1 ) { ++$n; printf "\ri:$i n:$n"; redo OUTER unless rand32768() == $first[ $i ]; } last; } print "\nFirst sequence of $M values repeated itself after $n calls to + rand";

        I've tweaked the code to log how many value were drawn for each increasing length sequence. I'll leave it running over night.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

      Which as each pick is 'independant' of the previous and next picks, it increases the effective entropy.
      But that's exactly the problem. As I indicated, if rand() would be perfect, it will pick each of the 645345427773512447880377451634304602899218432 different possible string with an equal chance.

      But with most implementations of rand the picks are not independant; the sequence of return values of rand is completely determined by the value of the seed. If srand takes a 64 bit number as argument, then there are at most 264 sequences of return values of rand() possible. But that requires an excellent implementation of rand(), and you got to be lucky enough the mapping to 0..61 doesn't provide duplicates.

        Let's say we were on a 2 bit processor and we had a 2-bit PRNG. There could only be 22 starting points (seeds).

        But the (non-repeating) sequences it could produce are any permutation of the following 24 permutations of the 4 basic values it can produce:

        {0, 1, 2, 3} | {0, 1, 3, 2} | {0, 2, 1, 3} | {0, 2, 3, 1} | {0, 3, 1, 2} | {0, 3, 2, 1} | {1, 0, 2, 3} | {1, 0, 3, 2} | {1, 2, 0, 3} | {1, 2, 3, 0} | {1, 3, 0, 2} | {1, 3, 2, 0} | {2, 0, 1, 3} | {2, 0, 3, 1} | {2, 1, 0, 3} | {2, 1, 3, 0} | {2, 3, 0, 1} | {2, 3, 1, 0} | {3, 0, 1, 2} | {3, 0, 2, 1} | {3, 1, 0, 2} | {3, 1, 2, 0} | {3, 2, 0, 1} | {3, 2, 1, 0}

        Hence, the 32-bit, Mersenne Twister MT19937 can produce 219937 - 1 values (from any given starting point) before it repeats itself exactly.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?