Re: how can I speed up this perl??

Replies are listed 'Best First'.
using s/// as map? by sleepingsquirrel (Chaplain) on Nov 24, 2003 at 18:31 UTC
Yeah, I also like the idea of using strings instead of arrays. When I read the problem, my first thought was to try something like... `$genome=~s/(.)(?=(.))/++$count{$1.$2} and undef/eg;` [download] ...does anyone else tend to (ab)use the replacement operator as sort of a map which works on strings?	[reply] [d/l]
Re: using s/// as map? by dragonchild (Archbishop) on Nov 24, 2003 at 18:45 UTC
Why do the lookahead? Why not just `$genome =~ s/(..)/++$count{$1} and undef/eg;` [download] ------ We are the carpenters and bricklayers of the Information Age. The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6 ... strings and arrays will suffice. As they are easily available as native data types in any sane language, ... - blokhead, speaking on evolutionary algorithms Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.	[reply] [d/l]
Re: Re: using s/// as map? by sleepingsquirrel (Chaplain) on Nov 24, 2003 at 19:31 UTC
I'm assuming that the original poster wanted the frequency of all pairs, so we don't want the regex engine to consume more than one character per iteration. With... `$genome="acgt";` [download] Then the look ahead version will finish with... `$count{ac} == 1 $count{cg} == 1 #Note this match $count{gt} == 1` [download] ...while the non-lookahead version will have... `$count{ac} == 1 $count{gt} == 1` [download] ... i.e. it misses the "cg" case because it jumps through the string two characters at a time.	[reply] [d/l] [select]
Re: using s/// as map? by thospel (Hermit) on Nov 24, 2003 at 19:24 UTC
I use it a lot in perlgolf, but not in real code since it tends to destroy the string (notice that your code does, though you tried to avoid that I think). In real code using a while on a regex in scalar context is slightly faster and not so dangerous. When programming an inner loop and trying to be blazingly fast, you also should avoid things like $1.$2 since constructing a new value is a relatively expensive operation in perl. So the regex variant I'd use is: `$count{$1}++ while $genome =~ /(?=(..))./g` [download] But this is still twice as slow as the substr() variant.	[reply] [d/l]
Re: Re: using s/// as map? by Roy Johnson (Monsignor) on Nov 24, 2003 at 19:29 UTC
This should be a little better. `++$count{$1} while $genome =~ /\G(..)/g;` [download] The PerlMonks advocate for tr///	[reply] [d/l]
Re^3: using s/// as map? by thospel (Hermit) on Nov 24, 2003 at 19:46 UTC
Re^3a: using s/// as map? by Roy Johnson (Monsignor) on Nov 24, 2003 at 20:05 UTC
Re: using s/// as map? by BrowserUk (Patriarch) on Nov 24, 2003 at 21:17 UTC
You could also use unpack `print unpack '(A2X)*', 'abcdefghijklmnopqrstuvwxyz'; ab bc cd de ef fg gh hi ij jk kl lm mn no op pq qr rs st tu uv vw wx x +y yz z` [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail Hooray! Wanted!	[reply] [d/l]
Re: using s/// as map? by jryan (Vicar) on Nov 24, 2003 at 23:10 UTC
No, and for good reason - its very slow compared to a normal solution. That "e" stands for "eval string", which is pretty sluggish, and should only be used for good reason. Compare: use Benchmark; use strict; our $dummy = 0; our %count; my $gen = ""; foreach my $l (qw\|a t c g\|) { $gen .= "$l$_" for qw\|a t c g\|; } my @gen = split '', $gen; my %tests; foreach my $test ( qw\|chain subst hash_array hash_string\| ) { no strict refs; $tests{$test} = sub { *{$test}{CODE}->($gen,\@gen) } } timethese(100000, \%tests); sub chain { my ($gen,$genome) = @_; for my $i (0..$#$genome) { if (($genome->[$i] eq 'a') && ($genome->[$i+1] eq 'a')) { + ++$dummy; } elsif (($genome->[$i] eq 'a') && ($genome->[$i+1] eq 'g')) { + ++$dummy; } elsif (($genome->[$i] eq 'a') && ($genome->[$i+1] eq 'c')) { + ++$dummy; } elsif (($genome->[$i] eq 'a') && ($genome->[$i+1] eq 't')) { + ++$dummy; } elsif (($genome->[$i] eq 't') && ($genome->[$i+1] eq 'a')) { + ++$dummy; } elsif (($genome->[$i] eq 't') && ($genome->[$i+1] eq 'g')) { + ++$dummy; } elsif (($genome->[$i] eq 't') && ($genome->[$i+1] eq 'c')) { + ++$dummy; } elsif (($genome->[$i] eq 't') && ($genome->[$i+1] eq 't')) { + ++$dummy; } elsif (($genome->[$i] eq 'c') && ($genome->[$i+1] eq 'a')) { + ++$dummy; } elsif (($genome->[$i] eq 'c') && ($genome->[$i+1] eq 'g')) { + ++$dummy; } elsif (($genome->[$i] eq 'c') && ($genome->[$i+1] eq 'c')) { + ++$dummy; } elsif (($genome->[$i] eq 'c') && ($genome->[$i+1] eq 't')) { + ++$dummy; } elsif (($genome->[$i] eq 'g') && ($genome->[$i+1] eq 'a')) { + ++$dummy; } elsif (($genome->[$i] eq 'g') && ($genome->[$i+1] eq 'g')) { + ++$dummy; } elsif (($genome->[$i] eq 'g') && ($genome->[$i+1] eq 'c')) { + ++$dummy; } elsif (($genome->[$i] eq 'g') && ($genome->[$i+1] eq 't')) { + ++$dummy; } } } sub subst { my ($gen,$genome) = @_; $gen=~s/(.)(?=(.))/++$count{$1.$2} and undef/eg; } sub hash_array { my ($gen,$genome) = @_; $count { $genome -> [$_] . $genome -> [$_ + 1] } ++ for (0..$#$gen +ome); } sub hash_string { my ($gen,$genome) = @_; $count { substr($gen, $_, 2) }++ for (0..length($gen)-2); } __DATA__ Benchmark: timing 100000 iterations of chain, hash_array, hash_string, + subst... chain: 64 wallclock secs (41.09 usr + 0.00 sys = 41.09 CPU) @ 2 +433.68/s (n=100000) hash_array: 14 wallclock secs (11.24 usr + 0.00 sys = 11.24 CPU) @ 8 +896.80/s (n=100000) hash_string: 8 wallclock secs ( 6.26 usr + 0.00 sys = 6.26 CPU) @ 1 +5974.44/s (n=100000) subst: 17 wallclock secs (16.60 usr + 0.00 sys = 16.60 CPU) @ 6 +024.10/s (n=100000) [download]	[reply] [d/l]
Re: Re: using s/// as map? by BrowserUk (Patriarch) on Nov 24, 2003 at 23:46 UTC
Erm. Did you post the wrong version of your code? Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail Hooray! Wanted!	[reply]
Re: Re: Re: using s/// as map? by jryan (Vicar) on Nov 25, 2003 at 05:59 UTC
Re: Re: Re: Re: using s/// as map? by BrowserUk (Patriarch) on Nov 25, 2003 at 06:14 UTC
Some notes below your chosen depth have not been shown here
Re: using s/// as map? by etcshadow (Priest) on Nov 25, 2003 at 06:38 UTC
I tend to love to do it! (And I do not consider it abuse... I once wrote a brainfck interpretter that was just one big `s/.../.../eg`. Now who could call that* abuse?!? ;-D) ------------ :Wq Not an editor command: Wq	[reply] [d/l]
Inline::C by sleepingsquirrel (Chaplain) on Nov 25, 2003 at 06:19 UTC
I'd thought I'd whip up a little test to compare the perl hash based solution to an Inline-C solution. Looks like the C version is about 85x faster... Benchmark: timing 20 iterations of hash_string, inline... hash_string: 37 wallclock secs (37.08 usr + 0.01 sys = 37.09 CPU) @ 0.54/s (n=20) inline: 1 wallclock secs ( 0.44 usr + 0.00 sys = 0.44 CPU) @ 45.45/s (n=20) #!/usr/bin/perl use Inline C; use Benchmark; my $gen = "atgcgc"x500000; #3 million characters $tests{"inline"} = sub { string_inline_c($gen, length($gen)) }; $tests{"hash_string"} = sub { hash_string($gen) }; timethese(20, \%tests); sub hash_string { my ($genome) = @_; my %count; $count{ substr($genome, $_, 2) }++ for (0..length($genome)-2); } __END__ __C__ int string_inline_c(char genome, int len) { int i; int hash[96]; / The hashing function is simply 4(first char - 'a') + second ch +ar - 'a' / /* i.e. the bucket for gg is 4('g'-'a')+'g'-'a' = 30 / /initialize hash buckets which will get used/ /aa/ /ac/ /ag/ /at/ hash[ 0] = hash[ 2] = hash[ 6] = hash[19] = 0; /ca/ /cc/ /cg/ /ct/ hash[ 8] = hash[10] = hash[14] = hash[27] = 0; /ga/ /gc/ /gg/ /gt/ hash[24] = hash[26] = hash[30] = hash[43] = 0; /ta/ /tc/ /tg/ /tt/ hash[76] = hash[78] = hash[82] = hash[95] = 0; for(i=0;i<len-1;i++) { hash[4(genome[i]-'a')+(genome[i+1]-'a')]++; } / returning the proper perl hash is left as an / / exercise for the reader / / see also the Inline-C Cookbook */ return(1); } [download]	[reply] [d/l]
Re: Inline::C -- complete by sleepingsquirrel (Chaplain) on Nov 26, 2003 at 05:09 UTC
Just thought I finish off the code by actually returning the hash back to perl... #!/usr/bin/perl use Inline C; use Benchmark; my $gen = "atgcgc"x500000; #3 million characters my $h_ref; $tests{"inline"} = sub { $h_ref = string_inline_c($gen, length($gen)) +}; $tests{"hash_string"} = sub { hash_string($gen) }; timethese(2, \%tests); sub hash_string { my ($genome) = @_; my %count; $count{ substr($genome, $_, 2) }++ for (0..length($genome)-2); } __END__ __C__ SV* string_inline_c(char genome, int len) { int i; int hash[96]; HV perl_hash=newHV(); /* The hashing function is simply 4(first char - 'a') + second ch +ar - 'a' / /* i.e. the bucket for gg is 4('g'-'a')+'g'-'a' = 30 / /initialize our 'C' hash buckets which will get used/ /aa/ /ac/ /ag/ /at/ hash[ 0] = hash[ 2] = hash[ 6] = hash[19] = 0; /ca/ /cc/ /cg/ /ct/ hash[ 8] = hash[10] = hash[14] = hash[27] = 0; /ga/ /gc/ /gg/ /gt/ hash[24] = hash[26] = hash[30] = hash[43] = 0; /ta/ /tc/ /tg/ /tt/ hash[76] = hash[78] = hash[82] = hash[95] = 0; for(i=0;i<len-1;i++) { hash[4(genome[i]-'a')+(genome[i+1]-'a')]++; } /move our values over from the 'C' hash to the perl hash/ #define h(c,i) (hv_store(perl_hash, (c), sizeof((c))-1, newSViv(hash[( +i)]), 0)) h("aa", 0); h("ac", 2); h("ag", 6); h("at",19); h("ca", 8); h("cc",10); h("cg",14); h("ct",27); h("ga",24); h("gc",26); h("gg",30); h("gt",43); h("ta",76); h("tc",78); h("tg",82); h("tt",95); return newRV_noinc((SV) perl_hash); /return a ref to a hash/ } [download]	[reply] [d/l]