monkfan has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
This is a working code which try to remove strings that contain poly-ATCG, i.e. remove the strings when the composition of A or T or C or G is greater than a threshold. So with the array below it'll return only "ATCGAT".

My code below although gives the correct result, somehow I feel it's clumsy and slow. Typically it needs to handle array of of size thousands to tens of thousands. I wonder how would the venerable monks here would make it more efficient and compact.
#!/usr/bin/perl -w use strict; use Data::Dumper; my @set = qw (AAAAAT ATCGAT TTTTTG GCCCCC GTGGGG); my $lim = 0.75; my @sel = remove_poly( \@set, $lim); print "BEFORE:",scalar(@set),"\n"; print "AFTER:",scalar(@sel),"\n"; #print Dumper \@sel; sub remove_poly { my ($array,$lim) = @_; my $len = length $array->[0]; my @sel_array; foreach ( @{$array} ) { my $a_count = $_ =~ tr/A//; my $t_count = $_ =~ tr/T//; my $c_count = $_ =~ tr/C//; my $g_count = $_ =~ tr/G//; my $a_portion = $a_count/$len; my $t_portion = $t_count/$len; my $c_portion = $c_count/$len; my $g_portion = $g_count/$len; #print "$_ $a_portion $t_portion $c_portion $g_portion \n"; if ( $a_portion < $lim && $t_portion < $lim && $c_portion < $li +m && $g_portion < $lim ) { push @sel_array,$_; } else { print "$_\n"; next; } } #print Dumper \@sel_array ; return @sel_array; }
Update: Benchmark
Thanks so much guys. It's been a great learning experience, as always.
Rate limbic ewi fang roy auk1 brs_auk2 jdhed limbic 4029/s -- -58% -64% -67% -76% -77% -85% ewi 9693/s 141% -- -14% -21% -42% -45% -64% fang 11261/s 180% 16% -- -8% -32% -37% -58% roy 12211/s 203% 26% 8% -- -27% -31% -55% auk1 16620/s 313% 71% 48% 36% -- -6% -38% brs_auk2 17743/s 340% 83% 58% 45% 7% -- -34% jdhed 27022/s 571% 179% 140% 121% 63% 52% --
Regards,
Edward

Replies are listed 'Best First'.
Re: Removing Poly-ATCG from and Array of Strings
by jdhedden (Deacon) on Jun 09, 2005 at 12:59 UTC
    1. Remove the 'portion' calculations by using the integer product of 'lim' and 'length' in the comparisons. (Eliminates floating point calcuations and comparisons in the loop!)
    2. Put the 'tr's in the comparison so that you can short-circuit having to do all the translations.
    3. Use 'grep'.
    4. Return an array reference from the subroutine.
    #!/usr/bin/perl use strict; use warnings; MAIN: { my @set = qw (AAAAAT ATCGAT TTTTTG GCCCCC GTGGGG); my $lim = 0.75; print("BEFORE: ", scalar(@set), "\n"); my $results = remove_poly(\@set, $lim); print("AFTER: ", scalar(@$results), "\n"); #print(join("\n", @$results), "\n"); } exit(0); sub remove_poly { my $array = $_[0]; my $lim = $_[1]; $lim = int($lim * length($$array[0])); my @results = grep { (($_ =~ tr/A//) <= $lim) && (($_ =~ tr/T//) <= $lim) && (($_ =~ tr/C//) <= $lim) && (($_ =~ tr/G//) <= $lim) } @$array; return (\@results); } # EOF
      jdhedden,
      It is only fair to calculate $lim once if there is some guarantee that all strings will be the same length. It is possible that is what was intended by ewijaya's example since I am the only one that dynamically calculated the limit each iteration.

      Cheers - L~R

Re: Removing Poly-ATCG from and Array of Strings
by Roy Johnson (Monsignor) on Jun 09, 2005 at 13:24 UTC
    Normally I'm big on using tr///, but in this case, the symmetry of cases suggests using a regex to filter out strings where any character is repeated too many times:
    sub remove_poly { my ($array,$lim) = @_; grep { my $max_num = int($lim * length())-1; !/(.)(?:.*?\1){$max_num}/ } @$array; }
    update: non-greedy for less backtracking; another update to include length checking for each element.

    Caution: Contents may have been coded under pressure.
Re: Removing Poly-ATCG from and Array of Strings
by aukjan (Friar) on Jun 09, 2005 at 12:32 UTC

    The following should does the same as yours... I don't know if it is faster... You could test this with the benchmark module:
    use strict; use warnings; my @set = qw (AAAAAT ATCGAT TTTTTG GCCCCC GTGGGG); my $lim = 0.75; my @sel = remove_poly( \@set, $lim); print "BEFORE:",scalar(@set),"\n"; print "AFTER:",scalar(@sel),"\n"; sub remove_poly { my ($array,$lim) = @_; my $len = length $array->[0]; my @sel_array; @sel_array = grep { ( ((tr/A//)/$len < $lim) && ((tr/T//)/$len < $lim) && ((tr/C//)/$len < $lim) && ((tr/G//)/$len < $lim) ) and $_ } @$array +; return @sel_array; }
    Now since this is a lot smaller, you could even take it out of the sub..

    .:| If it can't be fixed .. Don't break it |:.

      Another way is to use map on the original array and set all the entries which you don't want to '', and filter those out lateron..
      my @set = qw (AAAAAT ATCGAT TTTTTG GCCCCC GTGGGG); my $lim = 0.75; my $len = length $set[0]; print "BEFORE:",scalar(@set),"\n"; map { ( ((tr/A//)/$len < $lim) && ((tr/T//)/$len < $lim) && ((tr/C//)/$len < $lim) && ((tr/G//)/$len < $lim) ) or $_ = '' } @set; print "AFTER:",scalar(@set),"\n";
      Now the @set contains:
      $VAR1 = ''; $VAR2 = 'ATCGAT'; $VAR3 = ''; $VAR4 = ''; $VAR5 = '';

      .:| If it can't be fixed .. Don't break it |:.

Re: Removing Poly-ATCG from and Array of Strings
by Fang (Pilgrim) on Jun 09, 2005 at 12:34 UTC

    I believe you could easily use grep in your subroutine, as in:

    sub remove_poly { my ($array,$lim) = @_; my $len = length $array->[0]; my @sel_array = grep { my $a_count = $_ =~ tr/A//; my $t_count = $_ =~ tr/T//; my $c_count = $_ =~ tr/C//; my $g_count = $_ =~ tr/G//; my $a_portion = $a_count/$len; my $t_portion = $t_count/$len; my $c_portion = $c_count/$len; my $g_portion = $g_count/$len; $_ if ( $a_portion < $lim && $t_portion < $lim && $c_portion < +$lim && $g_portion < $lim ) } @$array; return @sel_array; }

    The initial array being big, you could also return a reference to the resulting set, instead of the whole array.

Re: Removing Poly-ATCG from and Array of Strings
by Limbic~Region (Chancellor) on Jun 09, 2005 at 12:38 UTC
    ewijaya,
    Ok - I had an idea in my head that didn't pan out the way I expected. I would be interested in seeing what I ended up with benchmarked with some real sample data though.
    sub remove_poly { my ($list, $threshold) = @_; return map { my $lim = $threshold * length $list->[$_]; my (%cnt, $excede); for ( unpack 'U*', $list->[$_] ) { $excede = 1, last if ++$cnt{$_} > $lim; } $excede ? () : $list->[$_]; } 0 .. $#$list; }

    Cheers - L~R

Re: Removing Poly-ATCG from and Array of Strings
by mrborisguy (Hermit) on Jun 09, 2005 at 12:42 UTC

    I would think you could use some sort of grep. Here's kindof what I was thinking

    my @set = qw (AAAAAT ATCGAT TTTTTG GCCCCC GTGGGG); my $lim = 0.75; my @newset = grep { my $a = $_ =~ tr/A//; my $t = $_ =~ tr/T//; my $c = $_ =~ tr/C//; my $g = $_ =~ tr/G//; $a / $len < $lim and $c / $len < $lim and $g / $len < $lim and $t / $len < $lim; } @set;

    Update:
    Wow... I just got beat to the punch big time! Three people posted while I was writing my response!

        -Bryan