La Familia has asked for the wisdom of the Perl Monks concerning the following question:

I'm a newbie to Perl, having started about a month ago by working out of a book called "Perl Programming for Biologists". I'm working with subroutines, and am trying to work on a program to separate nucleotide sequences based on GC content using the sort function.

So far, this is what I have:
@sequences = (ATATGACTTG, GGGGATCCAC, ATACATATAC, AGGCTACGCT, GAGGCCGCGC); @list = sort GC @sequences; sub GC{ $count = 0; #keep count of g and c per sequence $index = 0; #cycle through each item in list for ($count = 0; { #Not sure, I've tried many things. if(g == $sequences{$index}) {$total = $count + 1} if(c == $sequences{$index}) {$total = $count + 1} $index++; return($a cmp $b); } } print join("\n", @list), "\n";
If I can get guidance, criticism, or help on this and what I'm doing wrong, that will be very much appreciated. Thanks all, perl is my first computer language so I plan to stick with it.

Replies are listed 'Best First'.
Re: Help Manipulating Sort with Subroutines
by ikegami (Patriarch) on Dec 23, 2010 at 06:36 UTC

    If I understand correctly, you want to sort by increasing number of "G" and "C"s.

    How can you compare letter counts if you only have one count? The callback is called a number of times to compare two items to be sorted. The To sort based on the number of "G"s and "C"s of items, you'll have to count the number of "G"s and "C"s in both items and compare those counts.

    my @sorted = sort { my $count_a = $a =~ tr/GC//; my $count_b = $b =~ tr/GC//; $count_a <=> $count_b } @sequences;

    That should do in almost all circumstances. If you're dealing with massive amount of data, the following optimisation might help:

    my @sorted = map substr($_, 4), sort map pack('N', tr/GC//) . $_, @sequences;
Re: Help Manipulating Sort with Subroutines
by Utilitarian (Vicar) on Dec 23, 2010 at 08:07 UTC
    A couple of hopefully helpful observations.
    • g != G you are checking for the presence of a character that never appears in your list.
    • you return a straight string comparison of $a and $b, you should be comparing their cumulative G and C counts.
    • the "spaceship operator" <=> is used for numerical comparison.
    • And finally, a hint: Regular expressions return the number of times they matched when called in a scalar context
    print "Good ",qw(night morning afternoon evening)[(localtime)[2]/6]," fellow monks."
      Thank you both for the help!