vivomancer has asked for the wisdom of the Perl Monks concerning the following question:

The following code works on my current version of perl. Array taxR could have 4,000 entries, there will be about 5 million different curEntry. The purpose of this is to check the taxonomic code of the current protein being read to see if it is included in an array of taxonomic codes to allow.
my @taxR = ("PLRV1", "PMTVS", "PVXHB"); my $curEntry = "PMTVS"; if($curEntry ~~ @taxR){ print "do rest of stuff"; }
With this code my entire program takes about 20 seconds to run on my test data set and 30 minutes on the real thing. I've tried this
my @taxR = ("PLRV1", "PMTVS", "PVXHB"); my $curEntry = "PMTVS"; if( first { $_ eq $curEntry } @taxR ){ print "do rest of stuff"; }
but the test data takes 3 minutes to run, so the real set would be unusably long. I have draconian IT guys that will never agree to upgrade perl on the Macs, version 5.8.8, so I was hoping you could help me find a replacement method that doesn't take a million years to run.

Replies are listed 'Best First'.
Re: Need a replacement method for older version of perl
by toolic (Bishop) on Jun 26, 2012 at 21:03 UTC
    Hash look-ups will probably speed things up:
    use warnings; use strict; my @taxR = ("PLRV1", "PMTVS", "PVXHB"); my %taxRh = map { $_ => 1 } @taxR; my $curEntry = "PMTVS"; if (exists $taxRh{$curEntry}) { print "do rest of stuff"; }
      Thanks, that's much more logical that what I was doing. Though it turns out that wasn't what was causing my program to break. I'm going to make a new thread since its going to be different enough from my topic title
Re: Need a replacement method for older version of perl
by tobyink (Canon) on Jun 26, 2012 at 21:06 UTC

    Using first might not be a great idea. If $curEntry is the empty string "", and @taxR does actually contain an empty string (so you should expect a match), then first will end up returning false.

    The any function from List::MoreUtils is probably a better choice.

    As to your question, you might get better performance from a hash:

    my @taxR = ("PLRV1", "PMTVS", "PVXHB"); # Copy the array into a hash. # Make sure you only do this once. my %taxR = map {$_ => 1} @taxR; my $curEntry = "PMTVS"; if (exists $taxR{$curEntry}) { print "do rest of stuff"; }
    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
      Thank you, that's much more logical that what I was doing. Though it turns out that wasn't what was causing my program to break. I'm going to make a new thread since its going to be different enough from my topic title
Re: Need a replacement method for older version of perl
by kcott (Archbishop) on Jun 27, 2012 at 07:09 UTC

    In this sort of situation, you'll want to aim for the code inside the 5,000,000 iterations to be as minimal as possible.

    I ran a few commandline tests comparing the smartmatch with a regex. The regex was 5-10 times faster. Here's a typical run:

    $ time perl -Mstrict -Mwarnings -E ' my @x = ((q{AXXX}) x 4000, qw{BXXX CXXX}); my $y = q{BXXX}; my $c = 0; for (1 .. 5000) { $y ~~ @x && ++$c; } say qq{count=$c}; ' count=5000 real 0m1.128s user 0m1.122s sys 0m0.004s $ time perl -Mstrict -Mwarnings -e ' my @x = ((q{AXXX}) x 4000, qw{BXXX CXXX}); my $y = q{BXXX}; my $c = 0; my $z = join q{|} => @x; for (1 .. 5000) { $z =~ m{\b$y\b} && ++$c; } print qq{count=$c\n}; ' count=5000 real 0m0.142s user 0m0.138s sys 0m0.003s

    See also: Benchmark

    -- Ken

Re: Need a replacement method for older version of perl
by Anonymous Monk on Jun 27, 2012 at 00:13 UTC
    It is very easy in Perl to write a little-bit of code that does something in a very inefficient way. What you are asking the computer to do is to iterate sequentially through up to 4,000 records, 5 million times. You do the math. What is plowing you under the ground is virtual-memory overhead,