in reply to how can I speed up this perl??

Do you happen to come from a C background ? In perl it's almost never a good idea to process a string as an array of characters. Building that array takes time and a lot of memory, and the operations you then do on it are rarely natural or fast (ok, a simple character walk is fast).

So the best way to solve this is to actually have the sequence in a plain string, and walk that string in a perlish way. Here substr() seems a good operator (you can also use a pattern match that selects 2 chars at a time, but that turns out to be slower).

Making perl fast is also to a great extent reducing the amount of opcodes that get executed. So your multiple if tests want to be replaced by something that takes less operations. As pointed out by the other answers, you can use a hash here. While a hash lookup is slightly more work than a simple test, it only has to be done once.

So the code becomes:

my %count; $count{substr($genome, $_, 2)}++ for 0..length($genome)-2; # Next combine the counts like in the original code # is this a bug or intentional ? $count{tt} += $count{aa}; $count{ag} += $count{ct}; $count{ac} += $count{gt}; $count{tg} += $count{ca}; $count{ga} += $count{tc}; $count{cc} += $count{gg};
This is about 4 times as fast as an array based solution using a hash on my perl.

If this still isn't fast enough, you can start looking at things like Inline::C.

Replies are listed 'Best First'.
using s/// as map?
by sleepingsquirrel (Chaplain) on Nov 24, 2003 at 18:31 UTC
    Yeah, I also like the idea of using strings instead of arrays. When I read the problem, my first thought was to try something like...
    $genome=~s/(.)(?=(.))/++$count{$1.$2} and undef/eg;
    ...does anyone else tend to (ab)use the replacement operator as sort of a map which works on strings?
      Why do the lookahead? Why not just
      $genome =~ s/(..)/++$count{$1} and undef/eg;

      ------
      We are the carpenters and bricklayers of the Information Age.

      The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

      ... strings and arrays will suffice. As they are easily available as native data types in any sane language, ... - blokhead, speaking on evolutionary algorithms

      Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

        I'm assuming that the original poster wanted the frequency of all pairs, so we don't want the regex engine to consume more than one character per iteration. With...
        $genome="acgt";
        Then the look ahead version will finish with...
        $count{ac} == 1 $count{cg} == 1 #Note this match $count{gt} == 1
        ...while the non-lookahead version will have...
        $count{ac} == 1 $count{gt} == 1
        ... i.e. it misses the "cg" case because it jumps through the string two characters at a time.
      I use it a lot in perlgolf, but not in real code since it tends to destroy the string (notice that your code does, though you tried to avoid that I think). In real code using a while on a regex in scalar context is slightly faster and not so dangerous. When programming an inner loop and trying to be blazingly fast, you also should avoid things like $1.$2 since constructing a new value is a relatively expensive operation in perl. So the regex variant I'd use is:
      $count{$1}++ while $genome =~ /(?=(..))./g
      But this is still twice as slow as the substr() variant.
        This should be a little better.
        ++$count{$1} while $genome =~ /\G(..)/g;
        The PerlMonks advocate for tr///

      You could also use unpack

      print unpack '(A2X)*', 'abcdefghijklmnopqrstuvwxyz'; ab bc cd de ef fg gh hi ij jk kl lm mn no op pq qr rs st tu uv vw wx x +y yz z

      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      Hooray!
      Wanted!

      No, and for good reason - its very slow compared to a normal solution. That "e" stands for "eval string", which is pretty sluggish, and should only be used for good reason. Compare:

      use Benchmark; use strict; our $dummy = 0; our %count; my $gen = ""; foreach my $l (qw|a t c g|) { $gen .= "$l$_" for qw|a t c g|; } my @gen = split '', $gen; my %tests; foreach my $test ( qw|chain subst hash_array hash_string| ) { no strict refs; $tests{$test} = sub { *{$test}{CODE}->($gen,\@gen) } } timethese(100000, \%tests); sub chain { my ($gen,$genome) = @_; for my $i (0..$#$genome) { if (($genome->[$i] eq 'a') && ($genome->[$i+1] eq 'a')) { + ++$dummy; } elsif (($genome->[$i] eq 'a') && ($genome->[$i+1] eq 'g')) { + ++$dummy; } elsif (($genome->[$i] eq 'a') && ($genome->[$i+1] eq 'c')) { + ++$dummy; } elsif (($genome->[$i] eq 'a') && ($genome->[$i+1] eq 't')) { + ++$dummy; } elsif (($genome->[$i] eq 't') && ($genome->[$i+1] eq 'a')) { + ++$dummy; } elsif (($genome->[$i] eq 't') && ($genome->[$i+1] eq 'g')) { + ++$dummy; } elsif (($genome->[$i] eq 't') && ($genome->[$i+1] eq 'c')) { + ++$dummy; } elsif (($genome->[$i] eq 't') && ($genome->[$i+1] eq 't')) { + ++$dummy; } elsif (($genome->[$i] eq 'c') && ($genome->[$i+1] eq 'a')) { + ++$dummy; } elsif (($genome->[$i] eq 'c') && ($genome->[$i+1] eq 'g')) { + ++$dummy; } elsif (($genome->[$i] eq 'c') && ($genome->[$i+1] eq 'c')) { + ++$dummy; } elsif (($genome->[$i] eq 'c') && ($genome->[$i+1] eq 't')) { + ++$dummy; } elsif (($genome->[$i] eq 'g') && ($genome->[$i+1] eq 'a')) { + ++$dummy; } elsif (($genome->[$i] eq 'g') && ($genome->[$i+1] eq 'g')) { + ++$dummy; } elsif (($genome->[$i] eq 'g') && ($genome->[$i+1] eq 'c')) { + ++$dummy; } elsif (($genome->[$i] eq 'g') && ($genome->[$i+1] eq 't')) { + ++$dummy; } } } sub subst { my ($gen,$genome) = @_; $gen=~s/(.)(?=(.))/++$count{$1.$2} and undef/eg; } sub hash_array { my ($gen,$genome) = @_; $count { $genome -> [$_] . $genome -> [$_ + 1] } ++ for (0..$#$gen +ome); } sub hash_string { my ($gen,$genome) = @_; $count { substr($gen, $_, 2) }++ for (0..length($gen)-2); } __DATA__ Benchmark: timing 100000 iterations of chain, hash_array, hash_string, + subst... chain: 64 wallclock secs (41.09 usr + 0.00 sys = 41.09 CPU) @ 2 +433.68/s (n=100000) hash_array: 14 wallclock secs (11.24 usr + 0.00 sys = 11.24 CPU) @ 8 +896.80/s (n=100000) hash_string: 8 wallclock secs ( 6.26 usr + 0.00 sys = 6.26 CPU) @ 1 +5974.44/s (n=100000) subst: 17 wallclock secs (16.60 usr + 0.00 sys = 16.60 CPU) @ 6 +024.10/s (n=100000)

        Erm. Did you post the wrong version of your code?


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        Hooray!
        Wanted!

      I tend to love to do it! (And I do not consider it abuse... I once wrote a brainf*ck interpretter that was just one big s/.../.../eg. Now who could call that abuse?!? ;-D)

      ------------
      :Wq
      Not an editor command: Wq
Inline::C
by sleepingsquirrel (Chaplain) on Nov 25, 2003 at 06:19 UTC
    I'd thought I'd whip up a little test to compare the perl hash based solution to an Inline-C solution. Looks like the C version is about 85x faster...
    Benchmark: timing 20 iterations of hash_string, inline...
    hash_string: 37 wallclock secs (37.08 usr +  0.01 sys = 37.09 CPU) @  0.54/s (n=20)
        inline:  1 wallclock secs ( 0.44 usr +  0.00 sys =  0.44 CPU) @ 45.45/s (n=20)
    #!/usr/bin/perl use Inline C; use Benchmark; my $gen = "atgcgc"x500000; #3 million characters $tests{"inline"} = sub { string_inline_c($gen, length($gen)) }; $tests{"hash_string"} = sub { hash_string($gen) }; timethese(20, \%tests); sub hash_string { my ($genome) = @_; my %count; $count{ substr($genome, $_, 2) }++ for (0..length($genome)-2); } __END__ __C__ int string_inline_c(char *genome, int len) { int i; int hash[96]; /* The hashing function is simply 4*(first char - 'a') + second ch +ar - 'a' */ /* i.e. the bucket for gg is 4*('g'-'a')+'g'-'a' = 30 */ /*initialize hash buckets which will get used*/ /*aa*/ /*ac*/ /*ag*/ /*at*/ hash[ 0] = hash[ 2] = hash[ 6] = hash[19] = 0; /*ca*/ /*cc*/ /*cg*/ /*ct*/ hash[ 8] = hash[10] = hash[14] = hash[27] = 0; /*ga*/ /*gc*/ /*gg*/ /*gt*/ hash[24] = hash[26] = hash[30] = hash[43] = 0; /*ta*/ /*tc*/ /*tg*/ /*tt*/ hash[76] = hash[78] = hash[82] = hash[95] = 0; for(i=0;i<len-1;i++) { hash[4*(genome[i]-'a')+(genome[i+1]-'a')]++; } /* returning the proper perl hash is left as an */ /* exercise for the reader */ /* see also the Inline-C Cookbook */ return(1); }
      Just thought I finish off the code by actually returning the hash back to perl...
      #!/usr/bin/perl use Inline C; use Benchmark; my $gen = "atgcgc"x500000; #3 million characters my $h_ref; $tests{"inline"} = sub { $h_ref = string_inline_c($gen, length($gen)) +}; $tests{"hash_string"} = sub { hash_string($gen) }; timethese(2, \%tests); sub hash_string { my ($genome) = @_; my %count; $count{ substr($genome, $_, 2) }++ for (0..length($genome)-2); } __END__ __C__ SV* string_inline_c(char *genome, int len) { int i; int hash[96]; HV* perl_hash=newHV(); /* The hashing function is simply 4*(first char - 'a') + second ch +ar - 'a' */ /* i.e. the bucket for gg is 4*('g'-'a')+'g'-'a' = 30 */ /*initialize our 'C' hash buckets which will get used*/ /*aa*/ /*ac*/ /*ag*/ /*at*/ hash[ 0] = hash[ 2] = hash[ 6] = hash[19] = 0; /*ca*/ /*cc*/ /*cg*/ /*ct*/ hash[ 8] = hash[10] = hash[14] = hash[27] = 0; /*ga*/ /*gc*/ /*gg*/ /*gt*/ hash[24] = hash[26] = hash[30] = hash[43] = 0; /*ta*/ /*tc*/ /*tg*/ /*tt*/ hash[76] = hash[78] = hash[82] = hash[95] = 0; for(i=0;i<len-1;i++) { hash[4*(genome[i]-'a')+(genome[i+1]-'a')]++; } /*move our values over from the 'C' hash to the perl hash*/ #define h(c,i) (hv_store(perl_hash, (c), sizeof((c))-1, newSViv(hash[( +i)]), 0)) h("aa", 0); h("ac", 2); h("ag", 6); h("at",19); h("ca", 8); h("cc",10); h("cg",14); h("ct",27); h("ga",24); h("gc",26); h("gg",30); h("gt",43); h("ta",76); h("tc",78); h("tg",82); h("tt",95); return newRV_noinc((SV*) perl_hash); /*return a ref to a hash*/ }