Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Re: improving the efficiency of a script (random sample)

by ambrus (Abbot)
on Jun 18, 2006 at 20:46 UTC ( [id://556138]=note: print w/replies, xml ) Need Help??


in reply to improving the efficiency of a script

I show a one-pass solution to this problem using the combinatorical algorithm. Here one-pass means that you need only O(gm) memory if you want to print g words and the maximal word length is m, you have to read the file only once and don't even know the number of words in the dictionary in advance. Apart from this, I don't take emphasis on that the algorithm doesn't take too much computation time. That could also be easily done (while still keeping the previous efficency conditions true). For that, see algorithm R in chapter 3.4.2 in Knuth, but I leave the implementation as an exercise to the reader.

You didn't say if there's any requirement on the order of the words printed, so I assume it can be anything (whatever is simplest to implement). I'll also assume that if there's fewer than 100 words starting with a certain letter, we have to print all of them. And naturally assume the usual disclaimer for the code: I put this together fast and it may have errors.

As a simpler example, I first show how to just select 100 words uniformly randomly from a dictionary, independently of first letters.

use warnings; use strict; my $g = 100; my @c; my $n = 0; while(<>) { i +f (rand() < $g / ++$n) { splice @c, int(rand(@c)), $g <= @c, $_; } } +print for @c;
Now doing this for every letter we get this:
use warnings; use strict; my $g = 100; my %c; my %n; while(<>) { my $l + = /(.)/ && lc($1); my $c = \@{$c{$l}}; if (rand() < $g / ++$n{$l}) { + splice @$c, int(rand(@$c)), $g <= @$c, $_; } } print @$_ for values( +%c);

Update. Another one-pass solution would be to use heaps. You create a heap for each letter, add words as you read them to the corresponding heap using a random number as priority, and popping an element if the heap is larger than 100. I guess that this would be less CPU-efficent as the above mentioned good algorithm in Knuth if well implemented.

Update 2008 oct 9: see also Randomly select N lines from a file, on the fly.

Update 2009-12-26: see also Random sampling a variable record-length file. which – by the time you look there – should have some good solutions as well.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://556138]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (5)
As of 2024-03-29 10:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found