markdavis87 has asked for the wisdom of the Perl Monks concerning the following question:

Alright, so I have a question for all you Perl gurus out there... I have a program that yields a set of data, in a certain format. The program is written in Perl, and I would like to append some code to the program that will sort the data it produces a certain way. Here is what the original output looks like:

ALPHA:D 20 letters ABCCEDFFGAACDDDEEEFG ALPHA:D 20 letters ABCCEDFFGAACDDDEEEFG ALPHA:D 20 letters ABCCEDFGGAACDDDEEEFG ALPHA:D 20 letters ABCCEDFGGAACDDDEEEFG ALPHA:D 20 letters ABCCEDFGGAACDDDEEEFG ALPHA:D 20 letters ABCCEDFGGAACDDDEEEFG ALPHA:D 20 letters ABCCEDFFGAACDDEEEEFG ALPHA:D 20 letters ABCCEDFFGAACDDDEEEFG ALPHA:D 20 letters ABCCEDFFGAACDDEEEEFG ALPHA:E 24 letters ABCCEDFFGAACDDDEEEFGAGAD ALPHA:E 24 letters ABCCEDFFGAACDDDEEEFGAGAD ALPHA:E 24 letters ABCCEDFFGAACDDDEEEFGAGAD ALPHA:E 24 letters ABCCEDFFGAACDDDEEEFGAGAD ALPHA:E 24 letters ABCCEDFFGAACDDDEEEFGAGAD ALPHA:E 24 letters ABCCEDFFGAACDDDEEEFGAGAE ALPHA:E 24 letters ABCCEDFFGAACDDDEEEFGAGAE ALPHA:E 24 letters ABCCEDFFGAACDDDEEEFGAGAE ALPHA:E 24 letters ABCCEDFFGAACDDDEEEFGAGAE ALPHA:E 24 letters ABCCEDFFGAACDDDEEEFGAGAE ALPHA:E 24 letters ABCCEDFFGAACDDDEEEFGAGAE ALPHA:E 24 letters ABCCEDFFGAACDDDEEEFGAGAE

Each field is tab-delimited (except for the space between "20" and "letters", for example, which is just a space). I'm looking for a few lines of code that will sort the data so that it looks like this:

ALPHA:D 20 letters ABCCEDFGGAACDDDEEEFG 4 ALPHA:D 20 letters ABCCEDFFGAACDDDEEEFG 3 ALPHA:D 20 letters ABCCEDFFGAACDDEEEEFG 2 ALPHA:E 24 letters ABCCEDFFGAACDDDEEEFGAGAE 7 ALPHA:E 24 letters ABCCEDFFGAACDDDEEEFGAGAD 5

As you can see, what it's doing is looking at each value (e.g., "D" or "E") listed next to the initial name ("ALPHA", in this case), and counting the number of unique arrangements of letters for that specific value. Please note that while a line may contain the same number of letters (e.g., "20 letters"), it may not have the same arrangement. Certain letters might be different in the string. For value "D", the arrangement "ABCCEDFFGAACDDDEEEFG" appears 3 times, but not necessarily in order. The arrangement "ABCCEDFGGAACDDDEEEFG" appears 4 times. The code should get the counts, order them with the highest counts first, then move to the next value. I am assuming that there are some really basic string manipulation commands that can do this quite easily, but I am by no means an expert in Perl, so I have no idea how this would work. Could any of you help me out here? I would greatly appreciate it!

Replies are listed 'Best First'.
Re: String sorting in Perl
by Laurent_R (Canon) on Jun 04, 2014 at 17:34 UTC

    Hi markdavis87,

    this is not really sorting but really counting the number of distinct entries and then sorting the counts. The easiest way is to store your lines an a hash, with the full line being the key and the count the value. Assuming your lines are stored in the @data array, you could do this:

    my %count_hash; for my $line (@data) { $count_hash{$line} ++; }
    Then you only need to sort on line size and count. And you're done.

    Edit 17:39: To do the sort, something like this should probably work (untested):

    my @sorted_data = sort { length $a <=> length $b || $count_hash{b} <=> + $count_hash{$a} } keys %count_hash;
    This is supposed to sort the hash content in ascending order of line lengths and descending order of counts.

    Edit 2, 18:30: small typo on the sorting above statement. It should be:

    my @sorted_data = sort { length $a <=> length $b || $count_hash{$b} <= +> $count_hash{$a} } keys %count_hash;
    (I had $count_hash{b} instead of $count_hash{$b}.)

      Laurent_R,
      And you're done.

      It appears the OP is interested in getting the values out in descending order (sorted).

      for my $line (sort {$count_hash{$b} <=> $count_hash{$a}} keys %count_h +ash) { print "$line $count_hash{$line}\n"; }

      Cheers - L~R

        Yes, Limbic~Region, you're right ++, that's why I updated the post immediately after I posted it, but you saw it before I posted the change. Having said that, my understanding on how the sort should be carried out is not exactly the same as yours (I explained it in my update).

      That last anonymous post was me... Sorry! I figured I should give you the full code I'm using to test this out. Here it is:

      #!/usr/bin/perl -w use strict; my $file = "my path to the file name"; open (FH, "< $file") or die "Can't open $file for read: $!"; my @data = <FH>; close FH or die "Cannot close $file: $!"; my %count_hash; for my $line (@data) { $count_hash{$line} ++; } my @sorted_data = sort { length $a <=> length $b || $count_hash{b} <=> + $count_hash{$a} } keys %count_hash; print @sorted_data; # see if it worked

      Ah! I got it.

      my @sorted_data = sort { length $a <=> length $b || $count_hash{b} <=> + $count_hash{$a} } keys %count_hash;

      should be

      my @sorted_data = sort { length $a <=> length $b || $count_hash{$b} <= +> + $count_hash{$a} } keys %count_hash;

      We were just missing a "$" before the "b"...

        Yes, I was going to give the answer, but you found out yourself. Sorry for the typo. I'll update my post to get it right.

      When I try to run your code, I get the following output:

      Use of uninitialized value in numeric comparison (<=>) at ./sorttest.p +l line 12. Use of uninitialized value in numeric comparison (<=>) at ./sorttest.p +l line 12. Use of uninitialized value in numeric comparison (<=>) at ./sorttest.p +l line 12. Use of uninitialized value in numeric comparison (<=>) at ./sorttest.p +l line 12. ALPHA:D 20 letters ABCCEDFFGAACDDEEEEFG ALPHA:D 20 letters ABCCEDFGGAACDDDEEEFG ALPHA:D 20 letters ABCCEDFFGAACDDDEEEFG ALPHA:E 24 letters ABCCEDFFGAACDDDEEEFGAGAD ALPHA:E 24 letters ABCCEDFFGAACDDDEEEFGAGAE

      How can I resolve this, and where are the count values for these?

Re: String sorting in Perl
by perlfan (Parson) on Jun 05, 2014 at 13:09 UTC
    Is this a bioinformatics application? Have you checkout out anything available via BioPerl?