comment on

I show a one-pass solution to this problem using the combinatorical algorithm. Here one-pass means that you need only O(gm) memory if you want to print g words and the maximal word length is m, you have to read the file only once and don't even know the number of words in the dictionary in advance. Apart from this, I don't take emphasis on that the algorithm doesn't take too much computation time. That could also be easily done (while still keeping the previous efficency conditions true). For that, see algorithm R in chapter 3.4.2 in Knuth, but I leave the implementation as an exercise to the reader.

You didn't say if there's any requirement on the order of the words printed, so I assume it can be anything (whatever is simplest to implement). I'll also assume that if there's fewer than 100 words starting with a certain letter, we have to print all of them. And naturally assume the usual disclaimer for the code: I put this together fast and it may have errors.

As a simpler example, I first show how to just select 100 words uniformly randomly from a dictionary, independently of first letters.

use warnings; use strict; my $g = 100; my @c; my $n = 0; while(<>) { i
+f (rand() < $g / ++$n) { splice @c, int(rand(@c)), $g <= @c, $_; } } 
+print for @c;
[download]

Now doing this for every letter we get this:

use warnings; use strict; my $g = 100; my %c; my %n; while(<>) { my $l
+ = /(.)/ && lc($1); my $c = \@{$c{$l}}; if (rand() < $g / ++$n{$l}) {
+ splice @$c, int(rand(@$c)), $g <= @$c, $_; } } print @$_ for values(
+%c);
[download]

Update. Another one-pass solution would be to use heaps. You create a heap for each letter, add words as you read them to the corresponding heap using a random number as priority, and popping an element if the heap is larger than 100. I guess that this would be less CPU-efficent as the above mentioned good algorithm in Knuth if well implemented.

Update 2008 oct 9: see also Randomly select N lines from a file, on the fly.

Update 2009-12-26: see also Random sampling a variable record-length file. which – by the time you look there – should have some good solutions as well.

In reply to Re: improving the efficiency of a script (random sample) by ambrus
in thread improving the efficiency of a script by sulfericacid

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Think about Loose Coupling
	PerlMonks