comment on

If you're going to be doing A Lot of queries for this sort of thing, you might want to do some pre-processing on the dataset once, so you don't need to do it every time you look for a word.

For example, if N is constant (where N is the length of the word) across all runs of your code, you might build a new datafile containing only the N-letter words, with some meta-information like "number of lines in the file" tagged on to make processing easier. This is a classic time-space tradeoff: if you're doing a lot of queries and they need to be fast, it's often better to use a bit of extra space to store information about the data.

Another thing you might consider is processing the whole words file, and building an array of arrays, where the index in the top-level array corresponds to the number of letters in the word, and the sub-array is a list of seek() positions corresponding to each word of that length. Call the array @wordptrs. Then to choose a random $n letter word, you'd just do:

my $word_pos = $wordptrs[$n]->[rand(+@$wordptrs[$n])];
seek ($word_pos);
my $word = <>;
[download]

If you tie this array of arrays with MLDBM, you can build it Once, and access the persistent stored copy in the future. You could, of course, just store the words in the array of arrays, instead of seek positions. That would be faster. But it would also probably take up way more space.

Okay, just to argue against myself again... you might run into some storage limitations if you try to tie that array of arrays to MLDBM. I remember having a problem at one point where the underlying database would only hold 32k in each key (corresponding to a top-level array entry), so when the sub-array is Very Large, it'll run out of space and complain (or just not work). I'm not sure if this is still a problem or not.

But if you want to do this without any preprocessing at all... a few ideas and caveats:

To tell how to choose your random number, just get the file's size in bytes (with stat(), perhaps) and use that as a parameter to rand(). If you reach EOF before finding an appropriate word, then choose a new random seek position between 0 and your previous seek position, and stop looking when you get as far as your previous seek position.

And a caveat: statistically speaking, you're not going to get equal word distribution using the seek() method without preprocessing, unless there's a fairly flat distribution of word lengths in the file. If there are a small number of 6 letter words after M in the dictionary (compared to before M), then you're going to be more likely to hit the 6 letter words after M than the 6 letter words before M.

Hope this helps...

Alan

In reply to Re: RandomFile by ferrency
in thread RandomFile by jettero

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.