jettero has asked for the wisdom of the Perl Monks concerning the following question:
I'm wondering about two things though. First, how do ya know how long the file is so you can select a random starting place; and secondly, what should I do if I hit the eof before I find a word of the correct length?# was thinking of something like this seek( some random place in the file or something ); <>; # to get to the beginning of the next line. while(something) { read words until one matches the description }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: RandomFile
by ferrency (Deacon) on Aug 03, 2000 at 23:38 UTC | |
For example, if N is constant (where N is the length of the word) across all runs of your code, you might build a new datafile containing only the N-letter words, with some meta-information like "number of lines in the file" tagged on to make processing easier. This is a classic time-space tradeoff: if you're doing a lot of queries and they need to be fast, it's often better to use a bit of extra space to store information about the data. Another thing you might consider is processing the whole words file, and building an array of arrays, where the index in the top-level array corresponds to the number of letters in the word, and the sub-array is a list of seek() positions corresponding to each word of that length. Call the array @wordptrs. Then to choose a random $n letter word, you'd just do:
If you tie this array of arrays with MLDBM, you can build it Once, and access the persistent stored copy in the future. You could, of course, just store the words in the array of arrays, instead of seek positions. That would be faster. But it would also probably take up way more space. Okay, just to argue against myself again... you might run into some storage limitations if you try to tie that array of arrays to MLDBM. I remember having a problem at one point where the underlying database would only hold 32k in each key (corresponding to a top-level array entry), so when the sub-array is Very Large, it'll run out of space and complain (or just not work). I'm not sure if this is still a problem or not. But if you want to do this without any preprocessing at all... a few ideas and caveats: To tell how to choose your random number, just get the file's size in bytes (with stat(), perhaps) and use that as a parameter to rand(). If you reach EOF before finding an appropriate word, then choose a new random seek position between 0 and your previous seek position, and stop looking when you get as far as your previous seek position. And a caveat: statistically speaking, you're not going to get equal word distribution using the seek() method without preprocessing, unless there's a fairly flat distribution of word lengths in the file. If there are a small number of 6 letter words after M in the dictionary (compared to before M), then you're going to be more likely to hit the 6 letter words after M than the 6 letter words before M. Hope this helps... Alan | [reply] [d/l] [select] |
|
Re: RandomFile
by chip (Curate) on Aug 04, 2000 at 00:00 UTC | |
As for reading a random line from a file: If you're resigned to reading the whole file, this is a neat hack: -- Chip Salzenberg, Free-Floating Agent of Chaos | [reply] [d/l] |
|
Re: RandomFile
by Cirollo (Friar) on Aug 03, 2000 at 23:25 UTC | |
For example, As for hitting the eof, I would say just try again in a new spot. | [reply] [d/l] |
|
Re: RandomFile
by eak (Monk) on Aug 03, 2000 at 23:34 UTC | |
| [reply] |
|
Re: RandomFile
by grackle (Acolyte) on Aug 04, 2000 at 01:22 UTC | |
This seems like a very tough problem to me. I'm sure that preprocessing is the way to go here, but if there is an efficient way to do it with no preprocessing (either for /usr/dict/words or for an arbitrary file) it would be fascinating to see it. P.S. I don't know if the original poster wants approximately equal probabilities or not; he/she will probably use preprocessing anyway. P.P.S. In the algorithms, I'm assuming you go to the beginning of the file when you hit EOF. | [reply] |
by Anonymous Monk on Aug 04, 2000 at 18:08 UTC | |
| [reply] |
by jettero (Monsignor) on Aug 04, 2000 at 06:56 UTC | |
| [reply] [d/l] |
by dkubb (Deacon) on Jan 17, 2001 at 12:48 UTC | |
Here's an explanation of the code: Most similar routines will break when they randomly select the last line of a file. This is because they do the first <FH> , which forces the file cursor to start at the next line. The only problem is that if you are already positioned somewhere inside the *last* line, the second access of <FH> will return undef, which is not usually what you want. Another flaw, with the standard routine of this type, is that we are always looking one line ahead of the line that was randomly seek'd to. This means, that no matter how many times you run the routine, it will *always* miss line 1. The above routine gets around that by simly asking to see if we are at the end of the file, using eof. If we are, then we wrap to the beginning. This solves both the problems above, while still maintaining an acceptable level of randomness with rand. Update: This routine does have a bias towards longer lines in the file. Do not use it for files where the fields are variable length. Use the methods described here: How do I pick a random line from a file?. | [reply] [d/l] [select] |
by Jettra (Initiate) on Jan 17, 2001 at 13:16 UTC | |
|
RE: RandomFile
by jlistf (Monk) on Aug 03, 2000 at 23:31 UTC | |
if this needs to be explained, let me know. jeff | [reply] |
by jettero (Monsignor) on Aug 04, 2000 at 21:11 UTC | |
| [reply] |
by jlistf (Monk) on Aug 04, 2000 at 21:34 UTC | |
lets say we are searching for a 'random' word with a certain length. first assumption: these n-length words are uniformly distributed throughout the file. if this is the case, then chosing a word at random and moving forward till you find a n-length word will work. but this is a dictionary file... so you only want words at the beginning of entries. so you 'randomly' choose entries to find n-length words. allright... what happens if you get to the end of a file and you haven't found a proper length word. you have some options: 1) go to the beginning of the file and start looking. 2) work your way backwards in the file. 3) randomly choose another starting point. option 1 is the best option, and here's why: if you don't go back to the very beginning and start looking from there, then the words at the beginning are less likely to be selected. example: a an as at ... ... we ... if we are looking for a two-letter word, when will 'an' be chosen? if we go back to the beginning every time we reach the end, 'an' will be chosen whenever our first random choice falls between the 'we' entry and the end of the file, or between the beginning of the file and the 'an' entry. if we don't do this... 'an' will only be selected when our random choice falls between the beginning of the file and the 'an' entry. a much lower probability than before. if the words are uniformly distributed, 'an' will be chosen much less often than any of the other words. this example brings up another point mentioned in a different reply. notice that 'as' comes right after 'an' and 'at' comes right after 'as', but 'we' comes much, much later. this is an example of a non-uniform distribution. you are much more likely to select 'we' (assuming no other two-letter words in the file, other than the ones listed) than 'as' or 'at'. in order to select 'as' or 'at' you must land exactly on the proper line. to select 'we', you just have to land somewhere after 'at' and before 'we', a much larger span of lines. there is one more important point about randomness - also mentioned in another post, but i will reiterate it here. random numbers as generated by Perl are only pseudo-random. They are generated using a complicated algorithm, but it is based on a seed number (which can be set with srand()). They are also limited by the size of the MAX_RAND on your computer. In a large file - like a dicitonary file - these pseudo-random numbers can create strikingly non-random results. moral of the story: be careful when using random numbers. in this example, i'd suggest pre-processing the file (see other post - this will guarantee uniform distributions of n-length words), using truly random numbers (also explained in another post) and restarting at the beginning after an EOF. jeff ps. i hope this straightens everything out. kinda longwinded, but its a very in-depth topic. | [reply] [d/l] |
by Russ (Deacon) on Aug 05, 2000 at 02:20 UTC | |
by johannz (Hermit) on Aug 05, 2000 at 03:09 UTC | |
|
Re: RandomFile (Simple preprocessor)
by grackle (Acolyte) on Aug 04, 2000 at 01:50 UTC | |
You could create a file that includes nothing but the locations of the beginning of each line in /usr/dict/words. It would be big, but it would give you simple, efficient, and very effective solutions to your problem and many similar ones. For example, your problem would be solved by the following:
1. Read a certain number of random locations from the index
file (easy with constant-length records).
The number you of locations you read in step 1 would have to be optimized for performance (reading time vs. how often the algorithm must be restarted), and might be variable by the length of word you want. For instance, you could hard-code an array @num in your script so that reading $num(n) locations would locate an n-letter word 95% of the time, and the algorithm would be restarted the remaining 5% of the time. The numbers wouldn't be very big, and you wouldn't have to be right on the optimum solution to get good performance. The best thing about this solution is that the index file would be useful for almost any program that needs to get random words from a certain subset of /usr/dict/words. If you have any other problems similar to the one you posted, this might be the way to go. | [reply] |
|
Re: RandomFile
by larsen (Parson) on Aug 04, 2000 at 00:43 UTC | |
I haven't studied code provided there, but it seems interesting. I hope this could be useful :) Larsen | [reply] |