Benchmarking File Retrevial

jpfarmer has asked for the wisdom of the Perl Monks concerning the following question:

I've been messing around with the Benchmark module and experimenting what the quickest way is to approach file-reading. I'm currently benchmarking the following code working on a 2.5 MB wordlist:

use Benchmark qw/countit cmpthese/;

sub run($) { countit(1, @_) }

cmpthese {
    read_proc => run q{
    open(WORDS,"words.txt") or die("Wordlist unavaliable.\n");
    my @words = <WORDS>;
        close(WORDS);
    
        foreach $word (@words){
        chomp $word;
        if ($word =~ m/[aeiouyAEIOUY]{4,}/){
        push(@hitwords,$word);
        $hitcounter++; 
        }
        $counter++;
    }
    },
    for_proc => run q{
    open(WORDS,"words.txt") or die("Wordlist unavaliable.\n");
    
        foreach $word (<WORDS>){
        chomp $word;
        if ($word =~ m/[aeiouyAEIOUY]{4,}/){
        push(@hitwords,$word);
        $hitcounter++; 
        }
        $counter++;
    }
        close(WORDS);

    },
    while_proc => run q{
    open(WORDS,"words.txt") or die("Wordlist unavaliable.\n");

        while($word = <WORDS>){
        chomp $word;
        if ($word =~ m/[aeiouyAEIOUY]{4,}/){
        push(@hitwords,$word);
        $hitcounter++; 
        }
        $counter++;
    }
        close(WORDS);
    }

};
[download]

I would expect while_proc to be the fastest by far and the entire chunck of code should take only a few minutes to run. However, when I run it, it takes hours and then gives me this as a result:

           s/iter   for_proc  read_proc while_proc
for_proc     5983         --        -6%      -100%
read_proc    5645         6%         --      -100%
while_proc   2.42    246713%    232775%         --
[download]

Now I know that can't be right. Each block of code runs fine by itself, and while_proc IS the fastest version of the code. But the Benchmark results don't verify that at all. What am I doing wrong? I know I need to run more iterations for a reliable benchmark, but I can't really do that if one takes half the day.

Comment on Benchmarking File Retrevial Select or Download Code

Replies are listed 'Best First'.
Re: Benchmarking File Retrevial by demerphq (Chancellor) on Dec 15, 2002 at 21:35 UTC
Hi. Just a thought, but shouldnt you be reinitializing your variables before each test pass? Specifically `@hitwords` [download] Otherwise if the test data size is at all signifigant youll end up holidng multiple copies of it in memory. This could lead to excessive swapping and the like. Also I dont userstand your use of the sub run. cmpthese will execute count_it anyway so I dont see the purpose at all. A last point is that `while($word = <WORDS>){` [download] rings a bell somewhere. Its not that same thing iirc as saying `while (<WORDS>) {` [download] Although i could be wrong. I cant remember where this opinion comes from. I would have expected your benchmark to look more like: use Benchmark qw/cmpthese/; cmpthese 1,{ read_proc => <<'EOFCODE', open(my $words,"words.txt") or die("Wordlist unavaliable.\n"); my $hitcounter=0; my @hitwords; my $counter=0; my @words = <$words>; close($words); foreach my $word (@words){ chomp $word; if ($word =~ m/[aeiouyAEIOUY]{4,}/){ push(@hitwords,$word); $hitcounter++; } $counter++; } EOFCODE for_proc => <<'EOFCODE', open(my $words,"words.txt") or die("Wordlist unavaliable.\n"); my $hitcounter=0; my @hitwords; my $counter=0; foreach my $word (<$words>){ chomp $word; if ($word =~ m/[aeiouyAEIOUY]{4,}/){ push(@hitwords,$word); $hitcounter++; } $counter++; } close($words); EOFCODE while_proc => <<'EOFCODE', open(my $words,"words.txt") or die("Wordlist unavaliable.\n"); my $hitcounter=0; my @hitwords; my $counter=0; while(<$words>){ chomp; if (m/[aeiouyAEIOUY]{4,}/){ push(@hitwords,$_); $hitcounter++; } $counter++; } close($words); EOFCODE }; [download] --- demerphq my friends call me, usually because I'm late....	[reply] [d/l] [select]
Re: Re: Benchmarking File Retrevial by jpfarmer (Pilgrim) on Dec 15, 2002 at 23:14 UTC
I'm getting my Benchmark syntax from Programming Perl. Here's the example in the book: `use Benchmark qw/countit cmpthese/; sub run($) { countit(5, @_) } for $size (2, 200, 20_000) { $s = "." x $len; print "\nDATASIZE = $size\n"; cmpthese { chop2 => run q{ $t = $s; chop $t; chop $t; }, subs => run q{ ($t = $s) =~ s/..\Z//s; }, lsubstr => run q{ $t = $s; substr($t, -2) = ''; }, rsubstr => run q{ $t = substr($s, 0, length($s)-2); }, }; }` [download] Reinitializing @hitwords is a good idea. I'll try it. Also, I thought the only difference between `while($word = <WORDS>){` and `while(<WORDS>){` was that in the latter, the line was stored in !_. I haven't been able to find any documentation to the contrary, although I'd believe there might be a difference.	[reply] [d/l] [select]
Re: Re: Re: Benchmarking File Retrevial by chromatic (Archbishop) on Dec 15, 2002 at 23:51 UTC
Use B::Deparse. I think it may be documented in perlopen, but the second construct adds the `defined` operator. `$ perl -MO=Deparse while (<STDIN>) { print; }` [download] produces: `while (defined($_ = <STDIN>)) { print $_; } - syntax OK` [download]	[reply] [d/l] [select]
Re: Re: Re: Benchmarking File Retrevial by demerphq (Chancellor) on Dec 16, 2002 at 00:06 UTC
Well unless theres a version issue going on here then I would write that benchamrk like this `use Benchmark 'cmpthese'; for $size (2, 200, 20_000) { $s = "." x $len; print "\nDATASIZE = $size\n"; cmpthese -5,{ chop2 => '$t = $s; chop $t; chop $t;', subs => '($t = $s) =~ s/..\Z//s;', lsubstr => '$t = $s; substr($t, -2) = "";', rsubstr => '$t = substr($s, 0, length($s)-2);', }; }` [download] There are a few more ways, but this is a direct but less verbose copy of what you posted. The -5 argument indicates that the benchmarking for each item should take at minimum 5 seconds. (It may take longer) If you used a positive argument then it does that many runs of the given code. It will warn if you dont use enough iterations for it to get a "reasonable" sample. Please consult the Benchmark documentation as I suspect the interface has moved on since the edition of Programming Perl that you are using. I say this because your code does work, but it looks like Benchmark has been updated to do that idiom automatically. cheers --- demerphq my friends call me, usually because I'm late....	[reply] [d/l]
Re: Re: Benchmarking File Retrevial by jpfarmer (Pilgrim) on Dec 16, 2002 at 08:14 UTC
Using the code from Re: Benchmarking File Retrevial, I found that if both read_proc and for_proc are used together, for_proc will run and then read_proc will hang. If I comment either out, then the program will run properly. This behavior is under ActivePerl. I tried it under UNIX Perl out of curiousity, and it works properly. Both are the same version of Perl and the same version of Benchmark. Perhaps we've discovered a bug in ActivePerl? is there some other explaination for this behavior?	[reply]
Re: Benchmarking File Retrevial by pfaut (Priest) on Dec 15, 2002 at 18:52 UTC
Here's what I get with your code. I fed it /usr/share/dict/words. `Rate read_proc for_proc while_proc read_proc 2.21/s -- -17% -39% for_proc 2.65/s 20% -- -26% while_proc 3.60/s 63% 36% --` [download] `--- print map { my ($m)=1<<hex($_)&11?' ':''; $m.=substr('AHJPacehklnorstu',hex($_),1) } split //,'2fde0abe76c36c914586c';` [download]	[reply] [d/l] [select]