String sorting in Perl

markdavis87 has asked for the wisdom of the Perl Monks concerning the following question:

Alright, so I have a question for all you Perl gurus out there... I have a program that yields a set of data, in a certain format. The program is written in Perl, and I would like to append some code to the program that will sort the data it produces a certain way. Here is what the original output looks like:

ALPHA:D    20 letters    ABCCEDFFGAACDDDEEEFG
ALPHA:D    20 letters    ABCCEDFFGAACDDDEEEFG
ALPHA:D    20 letters    ABCCEDFGGAACDDDEEEFG
ALPHA:D    20 letters    ABCCEDFGGAACDDDEEEFG
ALPHA:D    20 letters    ABCCEDFGGAACDDDEEEFG
ALPHA:D    20 letters    ABCCEDFGGAACDDDEEEFG
ALPHA:D    20 letters    ABCCEDFFGAACDDEEEEFG
ALPHA:D    20 letters    ABCCEDFFGAACDDDEEEFG
ALPHA:D    20 letters    ABCCEDFFGAACDDEEEEFG
ALPHA:E    24 letters    ABCCEDFFGAACDDDEEEFGAGAD
ALPHA:E    24 letters    ABCCEDFFGAACDDDEEEFGAGAD
ALPHA:E    24 letters    ABCCEDFFGAACDDDEEEFGAGAD
ALPHA:E    24 letters    ABCCEDFFGAACDDDEEEFGAGAD
ALPHA:E    24 letters    ABCCEDFFGAACDDDEEEFGAGAD
ALPHA:E    24 letters    ABCCEDFFGAACDDDEEEFGAGAE
ALPHA:E    24 letters    ABCCEDFFGAACDDDEEEFGAGAE
ALPHA:E    24 letters    ABCCEDFFGAACDDDEEEFGAGAE
ALPHA:E    24 letters    ABCCEDFFGAACDDDEEEFGAGAE
ALPHA:E    24 letters    ABCCEDFFGAACDDDEEEFGAGAE
ALPHA:E    24 letters    ABCCEDFFGAACDDDEEEFGAGAE
ALPHA:E    24 letters    ABCCEDFFGAACDDDEEEFGAGAE
[download]

Each field is tab-delimited (except for the space between "20" and "letters", for example, which is just a space). I'm looking for a few lines of code that will sort the data so that it looks like this:

ALPHA:D    20 letters    ABCCEDFGGAACDDDEEEFG    4
ALPHA:D    20 letters    ABCCEDFFGAACDDDEEEFG    3
ALPHA:D    20 letters      ABCCEDFFGAACDDEEEEFG    2
ALPHA:E    24 letters    ABCCEDFFGAACDDDEEEFGAGAE    7
ALPHA:E    24 letters    ABCCEDFFGAACDDDEEEFGAGAD    5
[download]

As you can see, what it's doing is looking at each value (e.g., "D" or "E") listed next to the initial name ("ALPHA", in this case), and counting the number of unique arrangements of letters for that specific value. Please note that while a line may contain the same number of letters (e.g., "20 letters"), it may not have the same arrangement. Certain letters might be different in the string. For value "D", the arrangement "ABCCEDFFGAACDDDEEEFG" appears 3 times, but not necessarily in order. The arrangement "ABCCEDFGGAACDDDEEEFG" appears 4 times. The code should get the counts, order them with the highest counts first, then move to the next value. I am assuming that there are some really basic string manipulation commands that can do this quite easily, but I am by no means an expert in Perl, so I have no idea how this would work. Could any of you help me out here? I would greatly appreciate it!

Comment on String sorting in Perl Select or Download Code

Replies are listed 'Best First'.
Re: String sorting in Perl by Laurent_R (Canon) on Jun 04, 2014 at 17:34 UTC
Hi markdavis87, this is not really sorting but really counting the number of distinct entries and then sorting the counts. The easiest way is to store your lines an a hash, with the full line being the key and the count the value. Assuming your lines are stored in the @data array, you could do this: `my %count_hash; for my $line (@data) { $count_hash{$line} ++; }` [download] Then you only need to sort on line size and count. And you're done. Edit 17:39: To do the sort, something like this should probably work (untested): `my @sorted_data = sort { length $a <=> length $b \|\| $count_hash{b} <=> + $count_hash{$a} } keys %count_hash;` [download] This is supposed to sort the hash content in ascending order of line lengths and descending order of counts. Edit 2, 18:30: small typo on the sorting above statement. It should be: `my @sorted_data = sort { length $a <=> length $b \|\| $count_hash{$b} <= +> $count_hash{$a} } keys %count_hash;` [download] (I had `$count_hash{b}` instead of `$count_hash{$b}`.)	[reply] [d/l] [select]
Re^2: String sorting in Perl by Limbic~Region (Chancellor) on Jun 04, 2014 at 17:38 UTC
Laurent_R, And you're done. It appears the OP is interested in getting the values out in descending order (sorted). `for my $line (sort {$count_hash{$b} <=> $count_hash{$a}} keys %count_h +ash) { print "$line $count_hash{$line}\n"; }` [download] Cheers - L~R	[reply] [d/l]
Re^3: String sorting in Perl by Laurent_R (Canon) on Jun 04, 2014 at 17:48 UTC
Yes, Limbic~Region, you're right ++, that's why I updated the post immediately after I posted it, but you saw it before I posted the change. Having said that, my understanding on how the sort should be carried out is not exactly the same as yours (I explained it in my update).	[reply]
Re^4: String sorting in Perl by Limbic~Region (Chancellor) on Jun 04, 2014 at 18:06 UTC
Re^2: String sorting in Perl by markdavis87 (Novice) on Jun 04, 2014 at 18:14 UTC
That last anonymous post was me... Sorry! I figured I should give you the full code I'm using to test this out. Here it is: `#!/usr/bin/perl -w use strict; my $file = "my path to the file name"; open (FH, "< $file") or die "Can't open $file for read: $!"; my @data = <FH>; close FH or die "Cannot close $file: $!"; my %count_hash; for my $line (@data) { $count_hash{$line} ++; } my @sorted_data = sort { length $a <=> length $b \|\| $count_hash{b} <=> + $count_hash{$a} } keys %count_hash; print @sorted_data; # see if it worked` [download]	[reply] [d/l]
Re^2: String sorting in Perl by markdavis87 (Novice) on Jun 04, 2014 at 18:23 UTC
Ah! I got it. `my @sorted_data = sort { length $a <=> length $b \|\| $count_hash{b} <=> + $count_hash{$a} } keys %count_hash;` [download] should be `my @sorted_data = sort { length $a <=> length $b \|\| $count_hash{$b} <= +> + $count_hash{$a} } keys %count_hash;` [download] We were just missing a "$" before the "b"...	[reply] [d/l] [select]
Re^3: String sorting in Perl by Laurent_R (Canon) on Jun 04, 2014 at 18:26 UTC
Yes, I was going to give the answer, but you found out yourself. Sorry for the typo. I'll update my post to get it right.	[reply]
Re^4: String sorting in Perl by markdavis87 (Novice) on Jun 04, 2014 at 18:29 UTC
Re^5: String sorting in Perl by Laurent_R (Canon) on Jun 04, 2014 at 18:47 UTC
Some notes below your chosen depth have not been shown here
Re^4: String sorting in Perl by markdavis87 (Novice) on Jun 04, 2014 at 19:06 UTC
Re^2: String sorting in Perl by Anonymous Monk on Jun 04, 2014 at 18:02 UTC
When I try to run your code, I get the following output: Use of uninitialized value in numeric comparison (<=>) at ./sorttest.p +l line 12. Use of uninitialized value in numeric comparison (<=>) at ./sorttest.p +l line 12. Use of uninitialized value in numeric comparison (<=>) at ./sorttest.p +l line 12. Use of uninitialized value in numeric comparison (<=>) at ./sorttest.p +l line 12. ALPHA:D 20 letters ABCCEDFFGAACDDEEEEFG ALPHA:D 20 letters ABCCEDFGGAACDDDEEEFG ALPHA:D 20 letters ABCCEDFFGAACDDDEEEFG ALPHA:E 24 letters ABCCEDFFGAACDDDEEEFGAGAD ALPHA:E 24 letters ABCCEDFFGAACDDDEEEFGAGAE [download] How can I resolve this, and where are the count values for these?	[reply] [d/l]
Re: String sorting in Perl by perlfan (Parson) on Jun 05, 2014 at 13:09 UTC
Is this a bioinformatics application? Have you checkout out anything available via BioPerl?	[reply]