Come for the quick hacks, stay for the epiphanies. | |
PerlMonks |
Re: improving the efficiency of a script (random sample)by ambrus (Abbot) |
on Jun 18, 2006 at 20:46 UTC ( [id://556138]=note: print w/replies, xml ) | Need Help?? |
I show a one-pass solution to this problem using the combinatorical algorithm. Here one-pass means that you need only O(gm) memory if you want to print g words and the maximal word length is m, you have to read the file only once and don't even know the number of words in the dictionary in advance. Apart from this, I don't take emphasis on that the algorithm doesn't take too much computation time. That could also be easily done (while still keeping the previous efficency conditions true). For that, see algorithm R in chapter 3.4.2 in Knuth, but I leave the implementation as an exercise to the reader. You didn't say if there's any requirement on the order of the words printed, so I assume it can be anything (whatever is simplest to implement). I'll also assume that if there's fewer than 100 words starting with a certain letter, we have to print all of them. And naturally assume the usual disclaimer for the code: I put this together fast and it may have errors. As a simpler example, I first show how to just select 100 words uniformly randomly from a dictionary, independently of first letters. Now doing this for every letter we get this:
Update. Another one-pass solution would be to use heaps. You create a heap for each letter, add words as you read them to the corresponding heap using a random number as priority, and popping an element if the heap is larger than 100. I guess that this would be less CPU-efficent as the above mentioned good algorithm in Knuth if well implemented. Update 2008 oct 9: see also Randomly select N lines from a file, on the fly. Update 2009-12-26: see also Random sampling a variable record-length file. which – by the time you look there – should have some good solutions as well.
In Section
Seekers of Perl Wisdom
|
|