in reply to Estimating Vocabulary

Well here's an alternate, a complete waste of cycles as it scales linearly with the number of words returned, OTOH it is not bounded by the size of the dictionary. (As is) It can also return duplicates, yada yada yada.
my(@lines, $line); open(FILE, shift) || die; until( scalar @lines == $ARGV[0] ){ seek(FILE, 0, $. = 0); rand($.) < 1 && ($line = $_) while <FILE>; push(@lines, $line); } print @lines, "wc -l could have told you this is $. words\n";
It's based on "How do I select a random line from a file?" in perlfaq5. I'd be interested in seeing if anybody else has a better means of extending this algorythm to report multiple entries.

--
perl -pe "s/\b;([st])/'\1/mg"

Replies are listed 'Best First'.
Re: Re: Estimating Vocabulary
by I0 (Priest) on Mar 28, 2002 at 00:41 UTC
    my(@lines, $line); open(FILE, shift) || die; 1 while <FILE>; $line=$.; seek(FILE, 0, $. = 0); rand($line-$.) < $ARGV[0]-@lines && push(@lines,$_) while <FILE>; print @lines, "wc -l could have told you this is $. words\n";
      UPDATE:Excellent!

      WAS: That does not appear to work, I ask for one line and get 13-18 lines... It is also heavily weighted towards the Zs

      --
      perl -pe "s/\b;([st])/'\1/mg"

        Are you sure? It's working for me with only a small bias towards the Zs

        Update: Apparently, the observed bias was mostly an artifact of small sample size