jettero has asked for the wisdom of the Perl Monks concerning the following question:

There have been other questions like this, the closest is how do I pick a random line from a file? I want to select a random n-letter word from /usr/dict/words without having to start from the top and read toward the bottom.
# was thinking of something like this seek( some random place in the file or something ); <>; # to get to the beginning of the next line. while(something) { read words until one matches the description }
I'm wondering about two things though. First, how do ya know how long the file is so you can select a random starting place; and secondly, what should I do if I hit the eof before I find a word of the correct length?

Replies are listed 'Best First'.
Re: RandomFile
by ferrency (Deacon) on Aug 03, 2000 at 23:38 UTC
    If you're going to be doing A Lot of queries for this sort of thing, you might want to do some pre-processing on the dataset once, so you don't need to do it every time you look for a word.

    For example, if N is constant (where N is the length of the word) across all runs of your code, you might build a new datafile containing only the N-letter words, with some meta-information like "number of lines in the file" tagged on to make processing easier. This is a classic time-space tradeoff: if you're doing a lot of queries and they need to be fast, it's often better to use a bit of extra space to store information about the data.

    Another thing you might consider is processing the whole words file, and building an array of arrays, where the index in the top-level array corresponds to the number of letters in the word, and the sub-array is a list of seek() positions corresponding to each word of that length. Call the array @wordptrs. Then to choose a random $n letter word, you'd just do:

    my $word_pos = $wordptrs[$n]->[rand(+@$wordptrs[$n])]; seek ($word_pos); my $word = <>;
    If you tie this array of arrays with MLDBM, you can build it Once, and access the persistent stored copy in the future. You could, of course, just store the words in the array of arrays, instead of seek positions. That would be faster. But it would also probably take up way more space.

    Okay, just to argue against myself again... you might run into some storage limitations if you try to tie that array of arrays to MLDBM. I remember having a problem at one point where the underlying database would only hold 32k in each key (corresponding to a top-level array entry), so when the sub-array is Very Large, it'll run out of space and complain (or just not work). I'm not sure if this is still a problem or not.

    But if you want to do this without any preprocessing at all... a few ideas and caveats:

    To tell how to choose your random number, just get the file's size in bytes (with stat(), perhaps) and use that as a parameter to rand(). If you reach EOF before finding an appropriate word, then choose a new random seek position between 0 and your previous seek position, and stop looking when you get as far as your previous seek position.

    And a caveat: statistically speaking, you're not going to get equal word distribution using the seek() method without preprocessing, unless there's a fairly flat distribution of word lengths in the file. If there are a small number of 6 letter words after M in the dictionary (compared to before M), then you're going to be more likely to hit the 6 letter words after M than the 6 letter words before M.

    Hope this helps...

    Alan

Re: RandomFile
by chip (Curate) on Aug 04, 2000 at 00:00 UTC
    Unfortunately, any simple way of picking a random line from a text file will probably over-select long lines, or perhaps the lines after long lines. TANSTAAFL.

    As for reading a random line from a file: If you're resigned to reading the whole file, this is a neat hack:

    while (<FILE>) { $chosen_line = $_ if rand($.) < 1; }
        -- Chip Salzenberg, Free-Floating Agent of Chaos
Re: RandomFile
by Cirollo (Friar) on Aug 03, 2000 at 23:25 UTC
    To get the file size, you can use the '-s' operator which returns the size of a file (if it has non-zero size)
    For example,
    $foo = -s "file.txt"; print $foo;
    As for hitting the eof, I would say just try again in a new spot.
Re: RandomFile
by eak (Monk) on Aug 03, 2000 at 23:34 UTC
    The only way to do what you want is read the entire file. To quote a line from Advanced Perl Programming. (In general, since line lengths vary, it's not possible to access a particular line number without examining the whole file up to that line number, unless all your lines are known to be of a particular length, or you've built an index that translates line numbers into byte offsets.) --eric
Re: RandomFile
by grackle (Acolyte) on Aug 04, 2000 at 01:22 UTC
    ferrency is probably right that preprocessing is the only practical way to achieve a flat distribution (i.e., for all n-letter words to pop up with the same frequency). I don't have a solution that you'd want to implement, but it's an interesting question: how would you do this without preprocessing?

    Solution 1:
    Seek to a random place and read forward until you find an n-letter word.
    Result:
    Each word appears with a frequency proportional to the distance in the file between the beginning of its line and the beginning of the previous n-letter word's line. (It's hard to imagine when this would give desirable results, but the results would be (technically) random.)

    Solution 2:
    Seek to a random place. If the next line contains an n-letter word, use it; otherwise, seek to another random place.
    Result:
    Each n-letter word appears with a frequency proportional to the length of its predecessor's line. (For /usr/dict/words, it's better than the first solution, but still not good.)

    Solution 3:
    1. Seek to a random place.
    2. Eat a random number of lines (1-N) (flat dist.)
    3. If the next word has n letters, use it; otherwise go back to step 2.
    Result:
    I think I'll implement this to see how well it works. After looking at /usr/dict/words, I doubt this will work much better than Solution 2 for N<100, but I'll have to see.

    This seems like a very tough problem to me. I'm sure that preprocessing is the way to go here, but if there is an efficient way to do it with no preprocessing (either for /usr/dict/words or for an arbitrary file) it would be fascinating to see it.

    P.S. I don't know if the original poster wants approximately equal probabilities or not; he/she will probably use preprocessing anyway. P.P.S. In the algorithms, I'm assuming you go to the beginning of the file when you hit EOF.

      For some definition of efficient it is efficient as well. Seek to a random place in the file. Walk backwards until you find the start of that line, read it. If it is the wrong length try again. Average run-time is O(1) and your choice is perfectly random, but the constant depends on the proportion of the file that is in N letter words. If none then you will never find that out. (OK, you can skip whenever it comes up with a position you have tested. This involves more pre-processing but can be faster in the end.) Cheers, Ben
      I went with the first option. I was glad you gave me choices ;). The first one was exactly the behaviour I was looking for. You seen to dislike this option but I fail to see why. Here's the code I used, based on all the help in this thread:
      use strict; my $dict = "/usr/dict/words"; my ($set, $cur, $end) = (0, 1, 2); for my $i (1..30) { printf "word #%2d: \%s\n", $i, &pick_word(10); } sub pick_word { my $reqsize = shift; my $ret = ""; my $die = 0; open DICT, "$dict" or die "redmist code: $!"; seek DICT, rand(-s $dict), $set; <DICT>; # this get's us to the start of the next line. TOP: while (<DICT>) { ($ret = $1 and last) if /^(\w{$reqsize})$/; } if(length($ret) != $reqsize) { # ok, go back around seek DICT, 0, $set; # but remember we were already here ... die "Word too big, aaaaaaaahhhhhh!" if $die; $die = 1; goto TOP; } close DICT; return $ret; }
        Here's a routine that should return a random line from a file, without needing to read the entire contents of the file into memory:
        my $file = '/usr/dict/words'; #selecting a random line from a file my $line = do { open FH, "<$file" or die "Could not open $file: $!"; #Go to a random spot in the file seek FH, rand((-s FH) + 1), 0; #This forces us to the next line, because #we might be positioned in the middle of #a line in FH. <FH>; #If the next call to <FH> is about to #return a part of the last line in the #file, then wrap around to the beginning #of the file. We ONLY want complete lines. seek(FH, 0, 0) if eof; #Return the next available, complete, line <FH>; };

        Here's an explanation of the code:

        Most similar routines will break when they randomly select the last line of a file. This is because they do the first <FH> , which forces the file cursor to start at the next line. The only problem is that if you are already positioned somewhere inside the *last* line, the second access of <FH> will return undef, which is not usually what you want.

        Another flaw, with the standard routine of this type, is that we are always looking one line ahead of the line that was randomly seek'd to. This means, that no matter how many times you run the routine, it will *always* miss line 1.

        The above routine gets around that by simly asking to see if we are at the end of the file, using eof. If we are, then we wrap to the beginning. This solves both the problems above, while still maintaining an acceptable level of randomness with rand.

        Update: This routine does have a bias towards longer lines in the file. Do not use it for files where the fields are variable length. Use the methods described here: How do I pick a random line from a file?.

RE: RandomFile
by jlistf (Monk) on Aug 03, 2000 at 23:31 UTC
    the first part of this was already solved in another reply. as for the second part... it depends on how you want to implement it. you COULD start at another place at random. you COULD start working backwards. but i think for a truly random selection, you should probably start over at the very beginning of the file.

    if this needs to be explained, let me know.

    jeff
      yeah ... definitely needs to be explained. ;)
        here we go:

        lets say we are searching for a 'random' word with a certain length. first assumption: these n-length words are uniformly distributed throughout the file. if this is the case, then chosing a word at random and moving forward till you find a n-length word will work. but this is a dictionary file... so you only want words at the beginning of entries. so you 'randomly' choose entries to find n-length words. allright... what happens if you get to the end of a file and you haven't found a proper length word. you have some options: 1) go to the beginning of the file and start looking. 2) work your way backwards in the file. 3) randomly choose another starting point. option 1 is the best option, and here's why: if you don't go back to the very beginning and start looking from there, then the words at the beginning are less likely to be selected. example:

        a
        an
        as
        at
        ...
        ...
        we
        ...

        if we are looking for a two-letter word, when will 'an' be chosen?
        if we go back to the beginning every time we reach the end, 'an' will be chosen whenever our first random choice falls between the 'we' entry and the end of the file, or between the beginning of the file and the 'an' entry. if we don't do this... 'an' will only be selected when our random choice falls between the beginning of the file and the 'an' entry. a much lower probability than before. if the words are uniformly distributed, 'an' will be chosen much less often than any of the other words.

        this example brings up another point mentioned in a different reply. notice that 'as' comes right after 'an' and 'at' comes right after 'as', but 'we' comes much, much later. this is an example of a non-uniform distribution. you are much more likely to select 'we' (assuming no other two-letter words in the file, other than the ones listed) than 'as' or 'at'. in order to select 'as' or 'at' you must land exactly on the proper line. to select 'we', you just have to land somewhere after 'at' and before 'we', a much larger span of lines.

        there is one more important point about randomness - also mentioned in another post, but i will reiterate it here. random numbers as generated by Perl are only pseudo-random. They are generated using a complicated algorithm, but it is based on a seed number (which can be set with srand()). They are also limited by the size of the MAX_RAND on your computer. In a large file - like a dicitonary file - these pseudo-random numbers can create strikingly non-random results.

        moral of the story:
        be careful when using random numbers.
        in this example, i'd suggest pre-processing the file (see other post - this will guarantee uniform distributions of n-length words), using truly random numbers (also explained in another post) and restarting at the beginning after an EOF.

        jeff

        ps. i hope this straightens everything out. kinda longwinded, but its a very in-depth topic.
Re: RandomFile (Simple preprocessor)
by grackle (Acolyte) on Aug 04, 2000 at 01:50 UTC
    While I was thinking about methods that don't involve preprocessing, a nice solution to your problem hit me smack in the face. I didn't even see it coming.

    You could create a file that includes nothing but the locations of the beginning of each line in /usr/dict/words. It would be big, but it would give you simple, efficient, and very effective solutions to your problem and many similar ones.

    For example, your problem would be solved by the following:

    1. Read a certain number of random locations from the index file (easy with constant-length records).
    2. Check each location (in /usr/dict/words) until you find an n-letter word. If you don't find one, go back to step 1.

    The number you of locations you read in step 1 would have to be optimized for performance (reading time vs. how often the algorithm must be restarted), and might be variable by the length of word you want. For instance, you could hard-code an array @num in your script so that reading $num(n) locations would locate an n-letter word 95% of the time, and the algorithm would be restarted the remaining 5% of the time. The numbers wouldn't be very big, and you wouldn't have to be right on the optimum solution to get good performance.

    The best thing about this solution is that the index file would be useful for almost any program that needs to get random words from a certain subset of /usr/dict/words. If you have any other problems similar to the one you posted, this might be the way to go.

Re: RandomFile
by larsen (Parson) on Aug 04, 2000 at 00:43 UTC
    Consider reading how pseudorandom numbers work by Tom Phoenix.

    I haven't studied code provided there, but it seems interesting.

    I hope this could be useful :)

    Larsen