Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

RE: RandomFile

by jlistf (Monk)
on Aug 03, 2000 at 23:31 UTC ( [id://26069]=note: print w/replies, xml ) Need Help??


in reply to RandomFile

the first part of this was already solved in another reply. as for the second part... it depends on how you want to implement it. you COULD start at another place at random. you COULD start working backwards. but i think for a truly random selection, you should probably start over at the very beginning of the file.

if this needs to be explained, let me know.

jeff

Replies are listed 'Best First'.
RE: RE: RandomFile
by jettero (Monsignor) on Aug 04, 2000 at 21:11 UTC
    yeah ... definitely needs to be explained. ;)
      here we go:

      lets say we are searching for a 'random' word with a certain length. first assumption: these n-length words are uniformly distributed throughout the file. if this is the case, then chosing a word at random and moving forward till you find a n-length word will work. but this is a dictionary file... so you only want words at the beginning of entries. so you 'randomly' choose entries to find n-length words. allright... what happens if you get to the end of a file and you haven't found a proper length word. you have some options: 1) go to the beginning of the file and start looking. 2) work your way backwards in the file. 3) randomly choose another starting point. option 1 is the best option, and here's why: if you don't go back to the very beginning and start looking from there, then the words at the beginning are less likely to be selected. example:

      a
      an
      as
      at
      ...
      ...
      we
      ...

      if we are looking for a two-letter word, when will 'an' be chosen?
      if we go back to the beginning every time we reach the end, 'an' will be chosen whenever our first random choice falls between the 'we' entry and the end of the file, or between the beginning of the file and the 'an' entry. if we don't do this... 'an' will only be selected when our random choice falls between the beginning of the file and the 'an' entry. a much lower probability than before. if the words are uniformly distributed, 'an' will be chosen much less often than any of the other words.

      this example brings up another point mentioned in a different reply. notice that 'as' comes right after 'an' and 'at' comes right after 'as', but 'we' comes much, much later. this is an example of a non-uniform distribution. you are much more likely to select 'we' (assuming no other two-letter words in the file, other than the ones listed) than 'as' or 'at'. in order to select 'as' or 'at' you must land exactly on the proper line. to select 'we', you just have to land somewhere after 'at' and before 'we', a much larger span of lines.

      there is one more important point about randomness - also mentioned in another post, but i will reiterate it here. random numbers as generated by Perl are only pseudo-random. They are generated using a complicated algorithm, but it is based on a seed number (which can be set with srand()). They are also limited by the size of the MAX_RAND on your computer. In a large file - like a dicitonary file - these pseudo-random numbers can create strikingly non-random results.

      moral of the story:
      be careful when using random numbers.
      in this example, i'd suggest pre-processing the file (see other post - this will guarantee uniform distributions of n-length words), using truly random numbers (also explained in another post) and restarting at the beginning after an EOF.

      jeff

      ps. i hope this straightens everything out. kinda longwinded, but its a very in-depth topic.
        Explanations of Randomness strikes me as a node that required a lot of work. It certainly does not deserve a negative reputation, unless I am really missing something.

        Good place for a vote, methinks.

        Russ
        Brainbench 'Most Valuable Professional' for Perl

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://26069]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2024-03-29 05:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found