jpfarmer has asked for the wisdom of the Perl Monks concerning the following question:

I've been messing around with the Benchmark module and experimenting what the quickest way is to approach file-reading. I'm currently benchmarking the following code working on a 2.5 MB wordlist:
use Benchmark qw/countit cmpthese/; sub run($) { countit(1, @_) } cmpthese { read_proc => run q{ open(WORDS,"words.txt") or die("Wordlist unavaliable.\n"); my @words = <WORDS>; close(WORDS); foreach $word (@words){ chomp $word; if ($word =~ m/[aeiouyAEIOUY]{4,}/){ push(@hitwords,$word); $hitcounter++; } $counter++; } }, for_proc => run q{ open(WORDS,"words.txt") or die("Wordlist unavaliable.\n"); foreach $word (<WORDS>){ chomp $word; if ($word =~ m/[aeiouyAEIOUY]{4,}/){ push(@hitwords,$word); $hitcounter++; } $counter++; } close(WORDS); }, while_proc => run q{ open(WORDS,"words.txt") or die("Wordlist unavaliable.\n"); while($word = <WORDS>){ chomp $word; if ($word =~ m/[aeiouyAEIOUY]{4,}/){ push(@hitwords,$word); $hitcounter++; } $counter++; } close(WORDS); } };
I would expect while_proc to be the fastest by far and the entire chunck of code should take only a few minutes to run. However, when I run it, it takes hours and then gives me this as a result:
s/iter for_proc read_proc while_proc for_proc 5983 -- -6% -100% read_proc 5645 6% -- -100% while_proc 2.42 246713% 232775% --
Now I know that can't be right. Each block of code runs fine by itself, and while_proc IS the fastest version of the code. But the Benchmark results don't verify that at all. What am I doing wrong? I know I need to run more iterations for a reliable benchmark, but I can't really do that if one takes half the day.

Replies are listed 'Best First'.
Re: Benchmarking File Retrevial
by demerphq (Chancellor) on Dec 15, 2002 at 21:35 UTC
    Hi. Just a thought, but shouldnt you be reinitializing your variables before each test pass? Specifically
    @hitwords
    Otherwise if the test data size is at all signifigant youll end up holidng multiple copies of it in memory. This could lead to excessive swapping and the like.

    Also I dont userstand your use of the sub run. cmpthese will execute count_it anyway so I dont see the purpose at all. A last point is that

    while($word = <WORDS>){
    rings a bell somewhere. Its not that same thing iirc as saying
    while (<WORDS>) {
    Although i could be wrong. I cant remember where this opinion comes from.

    I would have expected your benchmark to look more like:

    use Benchmark qw/cmpthese/; cmpthese 1,{ read_proc => <<'EOFCODE', open(my $words,"words.txt") or die("Wordlist unavaliable.\n"); my $hitcounter=0; my @hitwords; my $counter=0; my @words = <$words>; close($words); foreach my $word (@words){ chomp $word; if ($word =~ m/[aeiouyAEIOUY]{4,}/){ push(@hitwords,$word); $hitcounter++; } $counter++; } EOFCODE for_proc => <<'EOFCODE', open(my $words,"words.txt") or die("Wordlist unavaliable.\n"); my $hitcounter=0; my @hitwords; my $counter=0; foreach my $word (<$words>){ chomp $word; if ($word =~ m/[aeiouyAEIOUY]{4,}/){ push(@hitwords,$word); $hitcounter++; } $counter++; } close($words); EOFCODE while_proc => <<'EOFCODE', open(my $words,"words.txt") or die("Wordlist unavaliable.\n"); my $hitcounter=0; my @hitwords; my $counter=0; while(<$words>){ chomp; if (m/[aeiouyAEIOUY]{4,}/){ push(@hitwords,$_); $hitcounter++; } $counter++; } close($words); EOFCODE };

    --- demerphq
    my friends call me, usually because I'm late....

      I'm getting my Benchmark syntax from Programming Perl. Here's the example in the book:
      use Benchmark qw/countit cmpthese/; sub run($) { countit(5, @_) } for $size (2, 200, 20_000) { $s = "." x $len; print "\nDATASIZE = $size\n"; cmpthese { chop2 => run q{ $t = $s; chop $t; chop $t; }, subs => run q{ ($t = $s) =~ s/..\Z//s; }, lsubstr => run q{ $t = $s; substr($t, -2) = ''; }, rsubstr => run q{ $t = substr($s, 0, length($s)-2); }, }; }
      Reinitializing @hitwords is a good idea. I'll try it. Also, I thought the only difference between while($word = <WORDS>){ and while(<WORDS>){ was that in the latter, the line was stored in !_. I haven't been able to find any documentation to the contrary, although I'd believe there might be a difference.

        Use B::Deparse. I think it may be documented in perlopen, but the second construct adds the defined operator.

        $ perl -MO=Deparse while (<STDIN>) { print; }

        produces:

        while (defined($_ = <STDIN>)) { print $_; } - syntax OK
        Well unless theres a version issue going on here then I would write that benchamrk like this
        use Benchmark 'cmpthese'; for $size (2, 200, 20_000) { $s = "." x $len; print "\nDATASIZE = $size\n"; cmpthese -5,{ chop2 => '$t = $s; chop $t; chop $t;', subs => '($t = $s) =~ s/..\Z//s;', lsubstr => '$t = $s; substr($t, -2) = "";', rsubstr => '$t = substr($s, 0, length($s)-2);', }; }
        There are a few more ways, but this is a direct but less verbose copy of what you posted. The -5 argument indicates that the benchmarking for each item should take at minimum 5 seconds. (It may take longer) If you used a positive argument then it does that many runs of the given code. It will warn if you dont use enough iterations for it to get a "reasonable" sample.

        Please consult the Benchmark documentation as I suspect the interface has moved on since the edition of Programming Perl that you are using. I say this because your code does work, but it looks like Benchmark has been updated to do that idiom automatically.

        cheers

        --- demerphq
        my friends call me, usually because I'm late....

      Using the code from Re: Benchmarking File Retrevial, I found that if both read_proc and for_proc are used together, for_proc will run and then read_proc will hang. If I comment either out, then the program will run properly. This behavior is under ActivePerl. I tried it under UNIX Perl out of curiousity, and it works properly. Both are the same version of Perl and the same version of Benchmark. Perhaps we've discovered a bug in ActivePerl? is there some other explaination for this behavior?
Re: Benchmarking File Retrevial
by pfaut (Priest) on Dec 15, 2002 at 18:52 UTC

    Here's what I get with your code. I fed it /usr/share/dict/words.

    Rate read_proc for_proc while_proc read_proc 2.21/s -- -17% -39% for_proc 2.65/s 20% -- -26% while_proc 3.60/s 63% 36% --
    --- print map { my ($m)=1<<hex($_)&11?' ':''; $m.=substr('AHJPacehklnorstu',hex($_),1) } split //,'2fde0abe76c36c914586c';