Re: RandomFile

ferrency is probably right that preprocessing is the only practical way to achieve a flat distribution (i.e., for all n-letter words to pop up with the same frequency). I don't have a solution that you'd want to implement, but it's an interesting question: how would you do this without preprocessing?

Solution 1:: Seek to a random place and read forward until you find an n-letter word.
Result:: Each word appears with a frequency proportional to the distance in the file between the beginning of its line and the beginning of the previous n-letter word's line. (It's hard to imagine when this would give desirable results, but the results would be (technically) random.)

Solution 2:: Seek to a random place. If the next line contains an n-letter word, use it; otherwise, seek to another random place.
Result:: Each n-letter word appears with a frequency proportional to the length of its predecessor's line. (For /usr/dict/words, it's better than the first solution, but still not good.)

Solution 3:: 1. Seek to a random place.
2. Eat a random number of lines (1-N) (flat dist.)
3. If the next word has n letters, use it; otherwise go back to step 2.
Result:: I think I'll implement this to see how well it works. After looking at /usr/dict/words, I doubt this will work much better than Solution 2 for N<100, but I'll have to see.

This seems like a very tough problem to me. I'm sure that preprocessing is the way to go here, but if there is an efficient way to do it with no preprocessing (either for /usr/dict/words or for an arbitrary file) it would be fascinating to see it.

P.S. I don't know if the original poster wants approximately equal probabilities or not; he/she will probably use preprocessing anyway. P.P.S. In the algorithms, I'm assuming you go to the beginning of the file when you hit EOF.

Comment on Re: RandomFile

Replies are listed 'Best First'.
There is a simple one by Anonymous Monk on Aug 04, 2000 at 18:08 UTC
For some definition of efficient it is efficient as well. Seek to a random place in the file. Walk backwards until you find the start of that line, read it. If it is the wrong length try again. Average run-time is O(1) and your choice is perfectly random, but the constant depends on the proportion of the file that is in N letter words. If none then you will never find that out. (OK, you can skip whenever it comes up with a position you have tested. This involves more pre-processing but can be faster in the end.) Cheers, Ben	[reply]
RE: Re: RandomFile by jettero (Monsignor) on Aug 04, 2000 at 06:56 UTC
I went with the first option. I was glad you gave me choices ;). The first one was exactly the behaviour I was looking for. You seen to dislike this option but I fail to see why. Here's the code I used, based on all the help in this thread: use strict; my $dict = "/usr/dict/words"; my ($set, $cur, $end) = (0, 1, 2); for my $i (1..30) { printf "word #%2d: \%s\n", $i, &pick_word(10); } sub pick_word { my $reqsize = shift; my $ret = ""; my $die = 0; open DICT, "$dict" or die "redmist code: $!"; seek DICT, rand(-s $dict), $set; <DICT>; # this get's us to the start of the next line. TOP: while (<DICT>) { ($ret = $1 and last) if /^(\w{$reqsize})$/; } if(length($ret) != $reqsize) { # ok, go back around seek DICT, 0, $set; # but remember we were already here ... die "Word too big, aaaaaaaahhhhhh!" if $die; $die = 1; goto TOP; } close DICT; return $ret; } [download]	[reply] [d/l]
Re: Random Line from a file by dkubb (Deacon) on Jan 17, 2001 at 12:48 UTC
Here's a routine that should return a random line from a file, without needing to read the entire contents of the file into memory: my $file = '/usr/dict/words'; #selecting a random line from a file my $line = do { open FH, "<$file" or die "Could not open $file: $!"; #Go to a random spot in the file seek FH, rand((-s FH) + 1), 0; #This forces us to the next line, because #we might be positioned in the middle of #a line in FH. <FH>; #If the next call to <FH> is about to #return a part of the last line in the #file, then wrap around to the beginning #of the file. We ONLY want complete lines. seek(FH, 0, 0) if eof; #Return the next available, complete, line <FH>; }; [download] Here's an explanation of the code: Most similar routines will break when they randomly select the last line of a file. This is because they do the first `<FH>` , which forces the file cursor to start at the next line. The only problem is that if you are already positioned somewhere inside the last line, the second access of <FH> will return undef, which is not usually what you want. Another flaw, with the standard routine of this type, is that we are always looking one line ahead of the line that was randomly seek'd to. This means, that no matter how many times you run the routine, it will always miss line 1. The above routine gets around that by simly asking to see if we are at the end of the file, using eof. If we are, then we wrap to the beginning. This solves both the problems above, while still maintaining an acceptable level of randomness with rand. Update: This routine does have a bias towards longer lines in the file. Do not use it for files where the fields are variable length. Use the methods described here: How do I pick a random line from a file?.	[reply] [d/l] [select]
Re: Re: Random Line from a file by Jettra (Initiate) on Jan 17, 2001 at 13:16 UTC
This will work fine if all the lines in the file are the same (or at least close) in length. However, if you have a file where the text of each line varies, your system will be biased towards longer lines since it is more likely to randomly 'hit' these ones over its smaller brothers. Jettra	[reply]