in reply to RandomFile
For example, if N is constant (where N is the length of the word) across all runs of your code, you might build a new datafile containing only the N-letter words, with some meta-information like "number of lines in the file" tagged on to make processing easier. This is a classic time-space tradeoff: if you're doing a lot of queries and they need to be fast, it's often better to use a bit of extra space to store information about the data.
Another thing you might consider is processing the whole words file, and building an array of arrays, where the index in the top-level array corresponds to the number of letters in the word, and the sub-array is a list of seek() positions corresponding to each word of that length. Call the array @wordptrs. Then to choose a random $n letter word, you'd just do:
If you tie this array of arrays with MLDBM, you can build it Once, and access the persistent stored copy in the future. You could, of course, just store the words in the array of arrays, instead of seek positions. That would be faster. But it would also probably take up way more space.my $word_pos = $wordptrs[$n]->[rand(+@$wordptrs[$n])]; seek ($word_pos); my $word = <>;
Okay, just to argue against myself again... you might run into some storage limitations if you try to tie that array of arrays to MLDBM. I remember having a problem at one point where the underlying database would only hold 32k in each key (corresponding to a top-level array entry), so when the sub-array is Very Large, it'll run out of space and complain (or just not work). I'm not sure if this is still a problem or not.
But if you want to do this without any preprocessing at all... a few ideas and caveats:
To tell how to choose your random number, just get the file's size in bytes (with stat(), perhaps) and use that as a parameter to rand(). If you reach EOF before finding an appropriate word, then choose a new random seek position between 0 and your previous seek position, and stop looking when you get as far as your previous seek position.
And a caveat: statistically speaking, you're not going to get equal word distribution using the seek() method without preprocessing, unless there's a fairly flat distribution of word lengths in the file. If there are a small number of 6 letter words after M in the dictionary (compared to before M), then you're going to be more likely to hit the 6 letter words after M than the 6 letter words before M.
Hope this helps...
Alan
|
|---|