in reply to Too much SQL not enough perl

You could answer the problem by using grep or the way you specified in your question, but if you are considering using it in the same manner that it works in SQL you need a hash for performance reasons:

#!/usr/bin/perl -w use strict; use Benchmark qw(:all); my @choices = qw|a b c|; my %choices = map{$_=>1}@choices; my @cData = <DATA>; timethese(100000,{'Damn Slow'=>\&parseData1, 'Much Better'=>\&parseData2}); sub parseData1 { foreach (@cData){ chomp; my @items = split/,/,$_; #slow way, foreach my $item (@items){ if(grep {$item eq $_} @choices){ #print qq|FOUND $item|; } } } } sub parseData2 { foreach(@cData){ chomp; foreach my $item (split/,/,$_){ if ($choices{$item}){ #print qq|FOUND $item\n|; } } } } __DATA__ z,t,m,u,a,b,c s,t,l,m,z,a,s c,b,a,m,u,t,n k,l,t,s,z,r,t

Produces:

Benchmark: timing 100000 iterations of Damn Slow, Much Better... Damn Slow: 9 wallclock secs ( 8.81 usr + 0.00 sys = 8.81 CPU) @ 11 +348.16/s n=100000) Much Better: 4 wallclock secs ( 3.88 usr + 0.00 sys = 3.88 CPU) @ 2 +5806.45/s (n=100000)

Celebrate Intellectual Diversity

Replies are listed 'Best First'.
Re^2: Too much SQL not enough perl
by EvanCarroll (Chaplain) on Oct 10, 2005 at 04:29 UTC
    I do not think it is fair you load up your hash prior to the benchmarking, that skews the results.

    Nor, is it fair that in the 'slow' one, you use a temporary array
    my @items = split/,/,$_; foreach my $item (@items){

    vs
    foreach my $item (split/,/,$_){


    Also, if ($choices{$item}) should probably be if (exists $choices{$item}) or you will choke on 0, empty strings, and undefs.

    UPDATE:

    Nor, is a grep a good idea, quoting a passage I remember reading in perldoc perlfaq: perldoc -q unique:
    These are slow (grep) (checks every element even if the first matches), inefficient (same reason), and potentially buggy
    But, granted it still says:
    Hearing the word "in" is an indication that you probably should have used a hash, not a list or array, to store your data. Hashes are designed to answer this question quickly and efficiently. Arrays aren’t.


    Evan Carroll
    www.EvanCarroll.com
      Note that with a little skullduggery, you can make grep short-circuit. Granted, it's still much more clear to use List::Util 'first', and for searching the same candidate list many times, it's more efficient to use a hash.
      my @candidates = qw(z y a b c a d a e a f); foreach my $question ('a', 'm') { if (do{{;grep {$_ eq $question ? do {print "Match\n"; last} : 0 } @candidates}}) { print "Found $question\n"; } }

      Caution: Contents may have been coded under pressure.

        Of course that really just uses grep for its side-effect, and you could just a little less twisted-brainedly say something like

        foreach my $question ('a', 'm') { if ( sub { $_ eq $question and return 1 for @candidates; 0; }->() +) { print "Found $question\n"; } }

        which is basically an inlined implementation of List::Util’s first.

        Makeshifts last the longest.

      Nor, is it fair that in the 'slow' one, you use a temporary array...

      EvanCarroll is right about the slight differences in the routines, so I commented out some things in code and made them both use the same array (note, I increased the number of elements to look for as well as the number of iterations for more meaningful results):

      #!/usr/bin/perl -w use strict; use Benchmark qw(:all); my @choices = qw|a b c z m p t l c f g|; my %choices = map{$_=>1}@choices; my @cData = <DATA>; timethese(50_000,{'Damn Slow'=>\&parseData1, 'Much Better'=>\&parseData2}); sub parseData1 { foreach (@cData){ chomp; my @items = split/,/,$_; #slow way, foreach my $item (@items){ if(grep {$item eq $_} @choices){ #print qq|FOUND $item|; } } } } sub parseData2 { foreach(@cData){ chomp; # foreach my $item (split/,/,$_){ my @items = split/,/,$_; #slow way, foreach my $item (@items){ if ($choices{$item}){ #print qq|FOUND $item\n|; } } } } __DATA__ z,t,m,u,a,b,c s,t,l,m,z,a,s c,b,a,m,u,t,n k,l,t,s,z,r,t

      Which still produces:

      perl seediff.pl Benchmark: timing 50000 iterations of Damn Slow, Much Better... Damn Slow: 6 wallclock secs ( 6.12 usr + 0.00 sys = 6.12 CPU) @ 81 +72.61/s (n=50000) Much Better: 3 wallclock secs ( 2.96 usr + 0.00 sys = 2.96 CPU) @ 1 +6874.79/s (n=50000)

      I think the above changes were 'fair' since the performance of a function like SQL in() in an actual database should not decrease noticeably with the number of elements.

      Celebrate Intellectual Diversity