danielfortin86 has asked for the wisdom of the Perl Monks concerning the following question:

I was wondering if there was a faster way of recording the number of instances of a letter at a given position? Basically, the code below does what I'd like to do, but I'd like it to be faster since it will be run quite often. Any suggestions?

@seq = split(//,$string); $i = 0; foreach$letter(@seq){ switch($letter){ case "A" { $A[$i]++;} case "C" { $C[$i]++; } case "G" { $G[$i]++;} case "T" { $T[$i]++; } else{$N[$i]++;} } $i++; }

Replies are listed 'Best First'.
Re: Iterating over string
by almut (Canon) on Apr 06, 2010 at 08:04 UTC

    Only a marginal speed improvement (5-20%, compared to the respective given/when rewrite of your original version), but more of a memory usage optimization, i.e. not splitting the string into an extra array of letters, which consumes a lot more memory than the original string  (only relevant if the strings could be huge, though — in which case @A, @C etc. would be huge as well, of course...):

    use 5.010; for (my $i=0; $i<length($string); $i++) { my $letter = substr($string, $i, 1); given ($letter) { when ("A") { $A[$i]++; } when ("C") { $C[$i]++; } when ("G") { $G[$i]++; } when ("T") { $T[$i]++; } default { $N[$i]++; } } }

    P.S.: Perl 5.10's given/when is more than ten times as fast as the old (and generally frowned upon) switch/case.

    Update:

    #!/usr/local/bin/perl use 5.010; use Switch; use Benchmark 'cmpthese'; $string = "AGCTAGCTAGAAGTCGGTGACTGfoobar"; sub orig_switchcase { my @seq = split(//,$string); my $i = 0; for my $letter (@seq) { switch ($letter) { case "A" { $A0[$i]++; } case "C" { $C0[$i]++; } case "G" { $G0[$i]++; } case "T" { $T0[$i]++; } else { $N0[$i]++; } } $i++; } } sub orig_givenwhen { my @seq = split(//,$string); my $i = 0; for my $letter (@seq) { given ($letter) { when ("A") { $A1[$i]++; } when ("C") { $C1[$i]++; } when ("G") { $G1[$i]++; } when ("T") { $T1[$i]++; } default { $N1[$i]++; } } $i++; } } sub substr_givenwhen { for (my $i=0; $i<length($string); $i++) { my $letter = substr($string, $i, 1); given ($letter) { when ("A") { $A2[$i]++; } when ("C") { $C2[$i]++; } when ("G") { $G2[$i]++; } when ("T") { $T2[$i]++; } default { $N2[$i]++; } } } } sub BUK { my $string2 = $string; $string2 =~ tr[ACGT][N]c; $c{ substr $string2, $_, 1 }[ $_ ]++ for 0 .. length( $string2 )-1 +; } cmpthese(-1, { 'orig_s/c' => \&orig_switchcase, 'orig_g/w' => \&orig_givenwhen, 'substr_g/w' => \&substr_givenwhen, BUK => \&BUK, }, ); __END__ Rate orig_s/c orig_g/w substr_g/w BUK orig_s/c 2112/s -- -91% -93% -97% orig_g/w 24660/s 1067% -- -19% -68% substr_g/w 30632/s 1350% 24% -- -60% BUK 77422/s 3565% 214% 153% --
Re: Iterating over string
by BrowserUk (Patriarch) on Apr 06, 2010 at 08:06 UTC

    This gathers the same information (ie. counts character at position in the string), and runs roughly 40 times faster.

    (Faster still if you don't need the tr///).

    my %c; $string =~ tr[ACGT][N]c; $c{ substr $string, $_, 1 }[ $_ ]++ for 0 .. length( $string )-1;

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Iterating over string
by Corion (Patriarch) on Apr 06, 2010 at 08:09 UTC

    If you just want to count characters, then tr or //g are likely the fastest. tr doesn't nicely interpolate, so I'm just looking at //g, and put it in list context:

    local $_ = $string; my $total; for my $letter (qw(A C G T)) { my $count =()= /$letter/g; print "$letter: $count\n"; $total += $count; } my $other = length($_) - $total; print "Other: $other\n";

    Depending on the length of $string a direct assignment might be faster, but you should Benchmark that with your data. Using code in a regex replacement causes various things to get set up and torn down again, so the overhead of passing over the string four times needs to be weighed against the overhead of a subroutine/block call in the regex:

    local $_ = $string; local %count; s/([ACGT])/$count{$1}++;$1/ge;
Re: Iterating over string
by cdarke (Prior) on Apr 06, 2010 at 07:56 UTC
    Not sure if it is quicker, it will depend on volumes, but you could Benchmark using a hash and RE:
    my $string = 'ALL CHANGE, THOSE GOING TO ASCOT'; my %hash; while ($string =~ /([AGCT])/g) { $hash{$1}++ } local $, = ' '; print %hash,"\n";


    By the way, don't use switch/case, it is buggy and deprecated. Use given/when instead.
Re: Iterating over string
by biohisham (Priest) on Apr 06, 2010 at 08:15 UTC
    The thing to look out for is doing away with the new-line chars as sequence data can be spread across a large number of lines, make it a habit to localize scope for your variables and turn the strictures on, that is a good programming habit, also, check perltidy because, your code has to be readable in order for it to be maintainable later and frankly, your coding style seems good with these indentation however, you skipped to add spaces at "foreach$letter(@seq)"...
    use strict; use warnings; my %counts; my @seq; local $/=''; while(<DATA>){ chop; s/\n//; @seq = split ''; } foreach my $element (@seq){ #count for each DNA base or group foreign letters $element =~/[agct]/gi ? $counts{$element}++ : $counts{'N'}++; } use Data::Dumper; print Dumper(\%counts); __DATA__ agcttgtc agtccxffhhhh
    One final suggestion/exercise, check Benchmark to compare speeds of segments of code in order to decide which one of these proposed approaches is fast(er|est) and pick it up. Though I believe, and would like to be corrected if otherwise, that switch statements are computationally expensive...


    Excellence is an Endeavor of Persistence. Chance Favors a Prepared Mind.
Re: Iterating over string
by Jenda (Abbot) on Apr 14, 2010 at 12:38 UTC

    Looks like most responders failed to notice you do not need the count of A/C/G/T in that string, but rather that you're goint to process several strings and count the number of A/C/G/Ts as the first character, second character, ...

    Do drop the four separate variables and use a hash of arrays and make sure you declare your variables. That way you do not have to use the switch:

    my @seq = split(//,$string); my $i = 0; foreach my $letter (@seq){ $counts{$letter}[$i]++ $i++; }

    Then as others suggested you may want to get rid of the split. You may either use the subtring() as suggested or something like this:

    my $i=0; $counts{$1}[$i++]++ while $string =~ /(.)/g;

    Test for yourself what's quickest.

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.