Hi,
This is a working code which try to remove strings that contain poly-ATCG, i.e. remove the strings when the composition of A or T or C or G is greater than a threshold. So with the array below it'll return only "ATCGAT".

My code below although gives the correct result, somehow I feel it's clumsy and slow. Typically it needs to handle array of of size thousands to tens of thousands. I wonder how would the venerable monks here would make it more efficient and compact.
#!/usr/bin/perl -w use strict; use Data::Dumper; my @set = qw (AAAAAT ATCGAT TTTTTG GCCCCC GTGGGG); my $lim = 0.75; my @sel = remove_poly( \@set, $lim); print "BEFORE:",scalar(@set),"\n"; print "AFTER:",scalar(@sel),"\n"; #print Dumper \@sel; sub remove_poly { my ($array,$lim) = @_; my $len = length $array->[0]; my @sel_array; foreach ( @{$array} ) { my $a_count = $_ =~ tr/A//; my $t_count = $_ =~ tr/T//; my $c_count = $_ =~ tr/C//; my $g_count = $_ =~ tr/G//; my $a_portion = $a_count/$len; my $t_portion = $t_count/$len; my $c_portion = $c_count/$len; my $g_portion = $g_count/$len; #print "$_ $a_portion $t_portion $c_portion $g_portion \n"; if ( $a_portion < $lim && $t_portion < $lim && $c_portion < $li +m && $g_portion < $lim ) { push @sel_array,$_; } else { print "$_\n"; next; } } #print Dumper \@sel_array ; return @sel_array; }
Update: Benchmark
Thanks so much guys. It's been a great learning experience, as always.
Rate limbic ewi fang roy auk1 brs_auk2 jdhed limbic 4029/s -- -58% -64% -67% -76% -77% -85% ewi 9693/s 141% -- -14% -21% -42% -45% -64% fang 11261/s 180% 16% -- -8% -32% -37% -58% roy 12211/s 203% 26% 8% -- -27% -31% -55% auk1 16620/s 313% 71% 48% 36% -- -6% -38% brs_auk2 17743/s 340% 83% 58% 45% 7% -- -34% jdhed 27022/s 571% 179% 140% 121% 63% 52% --
Regards,
Edward

In reply to Removing Poly-ATCG from and Array of Strings by monkfan

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.