using s/// as map?
by sleepingsquirrel (Chaplain) on Nov 24, 2003 at 18:31 UTC
|
Yeah, I also like the idea of using strings instead of arrays. When I read the problem, my first thought was to try something like...
$genome=~s/(.)(?=(.))/++$count{$1.$2} and undef/eg;
...does anyone else tend to (ab)use the replacement operator as sort of a map which works on strings? | [reply] [d/l] |
|
|
Why do the lookahead? Why not just
$genome =~ s/(..)/++$count{$1} and undef/eg;
------
We are the carpenters and bricklayers of the Information Age.
The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6
... strings and arrays will suffice. As they are easily available as native data types in any sane language, ... - blokhead, speaking on evolutionary algorithms
Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.
| [reply] [d/l] |
|
|
I'm assuming that the original poster wanted the frequency of all pairs, so we don't want the regex engine to consume more than one character per iteration. With...
$genome="acgt";
Then the look ahead version will finish with...
$count{ac} == 1
$count{cg} == 1 #Note this match
$count{gt} == 1
...while the non-lookahead version will have...
$count{ac} == 1
$count{gt} == 1
... i.e. it misses the "cg" case because it jumps through the string two characters at a time.
| [reply] [d/l] [select] |
|
|
I use it a lot in perlgolf, but not in real code since it tends to destroy the string (notice that your code does, though you tried to avoid that I think). In real code using a
while on a regex in scalar context is slightly faster and not so dangerous. When programming an inner loop and trying to be
blazingly fast, you also should avoid things like $1.$2 since
constructing a new value is a relatively expensive operation
in perl. So the regex variant I'd use is:
$count{$1}++ while $genome =~ /(?=(..))./g
But this is still twice as slow as the substr() variant. | [reply] [d/l] |
|
|
This should be a little better.
++$count{$1} while $genome =~ /\G(..)/g;
The PerlMonks advocate for tr///
| [reply] [d/l] |
|
|
|
|
|
|
print unpack '(A2X)*', 'abcdefghijklmnopqrstuvwxyz';
ab bc cd de ef fg gh hi ij jk kl lm mn no op pq qr rs st tu uv vw wx x
+y yz z
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
Hooray!
Wanted!
| [reply] [d/l] |
|
|
No, and for good reason - its very slow compared to a normal solution. That "e" stands for "eval string", which is pretty sluggish, and should only be used for good reason. Compare:
use Benchmark;
use strict;
our $dummy = 0;
our %count;
my $gen = "";
foreach my $l (qw|a t c g|) {
$gen .= "$l$_" for qw|a t c g|;
}
my @gen = split '', $gen;
my %tests;
foreach my $test ( qw|chain subst hash_array hash_string| ) {
no strict refs;
$tests{$test} = sub { *{$test}{CODE}->($gen,\@gen) }
}
timethese(100000, \%tests);
sub chain {
my ($gen,$genome) = @_;
for my $i (0..$#$genome) {
if (($genome->[$i] eq 'a') && ($genome->[$i+1] eq 'a')) { +
++$dummy; }
elsif (($genome->[$i] eq 'a') && ($genome->[$i+1] eq 'g')) { +
++$dummy; }
elsif (($genome->[$i] eq 'a') && ($genome->[$i+1] eq 'c')) { +
++$dummy; }
elsif (($genome->[$i] eq 'a') && ($genome->[$i+1] eq 't')) { +
++$dummy; }
elsif (($genome->[$i] eq 't') && ($genome->[$i+1] eq 'a')) { +
++$dummy; }
elsif (($genome->[$i] eq 't') && ($genome->[$i+1] eq 'g')) { +
++$dummy; }
elsif (($genome->[$i] eq 't') && ($genome->[$i+1] eq 'c')) { +
++$dummy; }
elsif (($genome->[$i] eq 't') && ($genome->[$i+1] eq 't')) { +
++$dummy; }
elsif (($genome->[$i] eq 'c') && ($genome->[$i+1] eq 'a')) { +
++$dummy; }
elsif (($genome->[$i] eq 'c') && ($genome->[$i+1] eq 'g')) { +
++$dummy; }
elsif (($genome->[$i] eq 'c') && ($genome->[$i+1] eq 'c')) { +
++$dummy; }
elsif (($genome->[$i] eq 'c') && ($genome->[$i+1] eq 't')) { +
++$dummy; }
elsif (($genome->[$i] eq 'g') && ($genome->[$i+1] eq 'a')) { +
++$dummy; }
elsif (($genome->[$i] eq 'g') && ($genome->[$i+1] eq 'g')) { +
++$dummy; }
elsif (($genome->[$i] eq 'g') && ($genome->[$i+1] eq 'c')) { +
++$dummy; }
elsif (($genome->[$i] eq 'g') && ($genome->[$i+1] eq 't')) { +
++$dummy; }
}
}
sub subst {
my ($gen,$genome) = @_;
$gen=~s/(.)(?=(.))/++$count{$1.$2} and undef/eg;
}
sub hash_array {
my ($gen,$genome) = @_;
$count { $genome -> [$_] . $genome -> [$_ + 1] } ++ for (0..$#$gen
+ome);
}
sub hash_string {
my ($gen,$genome) = @_;
$count { substr($gen, $_, 2) }++ for (0..length($gen)-2);
}
__DATA__
Benchmark: timing 100000 iterations of chain, hash_array, hash_string,
+ subst...
chain: 64 wallclock secs (41.09 usr + 0.00 sys = 41.09 CPU) @ 2
+433.68/s (n=100000)
hash_array: 14 wallclock secs (11.24 usr + 0.00 sys = 11.24 CPU) @ 8
+896.80/s (n=100000)
hash_string: 8 wallclock secs ( 6.26 usr + 0.00 sys = 6.26 CPU) @ 1
+5974.44/s (n=100000)
subst: 17 wallclock secs (16.60 usr + 0.00 sys = 16.60 CPU) @ 6
+024.10/s (n=100000)
| [reply] [d/l] |
|
|
| [reply] |
|
|
|
|
|
|
|
| [reply] [d/l] |
Inline::C
by sleepingsquirrel (Chaplain) on Nov 25, 2003 at 06:19 UTC
|
I'd thought I'd whip up a little test to compare the perl hash based solution to an Inline-C solution. Looks like the C version is about 85x faster...
Benchmark: timing 20 iterations of hash_string, inline...
hash_string: 37 wallclock secs (37.08 usr + 0.01 sys = 37.09 CPU) @ 0.54/s (n=20)
inline: 1 wallclock secs ( 0.44 usr + 0.00 sys = 0.44 CPU) @ 45.45/s (n=20)
#!/usr/bin/perl
use Inline C;
use Benchmark;
my $gen = "atgcgc"x500000; #3 million characters
$tests{"inline"} = sub { string_inline_c($gen, length($gen)) };
$tests{"hash_string"} = sub { hash_string($gen) };
timethese(20, \%tests);
sub hash_string {
my ($genome) = @_;
my %count;
$count{ substr($genome, $_, 2) }++ for (0..length($genome)-2);
}
__END__
__C__
int string_inline_c(char *genome, int len)
{
int i;
int hash[96];
/* The hashing function is simply 4*(first char - 'a') + second ch
+ar - 'a' */
/* i.e. the bucket for gg is 4*('g'-'a')+'g'-'a' = 30 */
/*initialize hash buckets which will get used*/
/*aa*/ /*ac*/ /*ag*/ /*at*/
hash[ 0] = hash[ 2] = hash[ 6] = hash[19] = 0;
/*ca*/ /*cc*/ /*cg*/ /*ct*/
hash[ 8] = hash[10] = hash[14] = hash[27] = 0;
/*ga*/ /*gc*/ /*gg*/ /*gt*/
hash[24] = hash[26] = hash[30] = hash[43] = 0;
/*ta*/ /*tc*/ /*tg*/ /*tt*/
hash[76] = hash[78] = hash[82] = hash[95] = 0;
for(i=0;i<len-1;i++)
{
hash[4*(genome[i]-'a')+(genome[i+1]-'a')]++;
}
/* returning the proper perl hash is left as an */
/* exercise for the reader */
/* see also the Inline-C Cookbook */
return(1);
}
| [reply] [d/l] |
|
|
Just thought I finish off the code by actually returning the hash back to perl...
#!/usr/bin/perl
use Inline C;
use Benchmark;
my $gen = "atgcgc"x500000; #3 million characters
my $h_ref;
$tests{"inline"} = sub { $h_ref = string_inline_c($gen, length($gen))
+};
$tests{"hash_string"} = sub { hash_string($gen) };
timethese(2, \%tests);
sub hash_string {
my ($genome) = @_;
my %count;
$count{ substr($genome, $_, 2) }++ for (0..length($genome)-2);
}
__END__
__C__
SV* string_inline_c(char *genome, int len)
{
int i;
int hash[96];
HV* perl_hash=newHV();
/* The hashing function is simply 4*(first char - 'a') + second ch
+ar - 'a' */
/* i.e. the bucket for gg is 4*('g'-'a')+'g'-'a' = 30 */
/*initialize our 'C' hash buckets which will get used*/
/*aa*/ /*ac*/ /*ag*/ /*at*/
hash[ 0] = hash[ 2] = hash[ 6] = hash[19] = 0;
/*ca*/ /*cc*/ /*cg*/ /*ct*/
hash[ 8] = hash[10] = hash[14] = hash[27] = 0;
/*ga*/ /*gc*/ /*gg*/ /*gt*/
hash[24] = hash[26] = hash[30] = hash[43] = 0;
/*ta*/ /*tc*/ /*tg*/ /*tt*/
hash[76] = hash[78] = hash[82] = hash[95] = 0;
for(i=0;i<len-1;i++)
{
hash[4*(genome[i]-'a')+(genome[i+1]-'a')]++;
}
/*move our values over from the 'C' hash to the perl hash*/
#define h(c,i) (hv_store(perl_hash, (c), sizeof((c))-1, newSViv(hash[(
+i)]), 0))
h("aa", 0); h("ac", 2); h("ag", 6); h("at",19);
h("ca", 8); h("cc",10); h("cg",14); h("ct",27);
h("ga",24); h("gc",26); h("gg",30); h("gt",43);
h("ta",76); h("tc",78); h("tg",82); h("tt",95);
return newRV_noinc((SV*) perl_hash); /*return a ref to a hash*/
}
| [reply] [d/l] |