Re: Sorting issue
by GrandFather (Saint) on Nov 04, 2011 at 20:54 UTC
|
Nope, I just can't do it. I tried to read your mind to determine what your input data looks like and to see how the sort was different than you want, but I just can't do it. Maybe you are too far away, or asleep, or just don't broadcast very well, but I failed. Sorry.
I suggest though that you take a hard look at your use of $tags in your foreach loop. Assigning to the loop variable seems wrong to me. There are of course times when that is the thing to do, but your code doesn't make it clear that changing the content of the loop variable is the intended behaviour as you use the same hash lookup ($tag{$tags}) in two places having changed $tags between times. If that is what you want I'd create a new variable to make it clear that that is the intent.
True laziness is hard work
| [reply] [d/l] |
|
|
#Input file
Tags Frequency
EEBBBBGGGBB 1700
BBBCDDERFGG 850
CCCDEDFFFES 45
----------- --
#output file
Header Tags Frequency
>HWTI_1700_468983 EEBBBBGGGBB 1700
>HWTI_850_52 BBBCDDERFGG 850
------------
With my code, I am able to sort it by Tags, but I want to sort by Frequency. Though, I tried the suggestion by "aaron_baugher", I am getting warning "use of uninitiated value..."
| [reply] [d/l] |
|
|
Ok, that helps. If the tags in your input file are guaranteed to be unique, it's easy. Put them in a hash with the frequencies as the values, and then sort on the values. In this example, %tags is the hash that stores the tags and their corresponding values, and then it's sorted on the values numerically, largest to smallest. The sub make_unique_string() creates the unique key for your output file from the tag and freq.
my %tags;
while(<$input_file_descriptor>){
# do stuff to skip headers and blank lines
chomp;
my( $tag, $freq ) = split /\s+/;
$tags{$tag} = $freq;
}
for my $tag (sort { $tags{$b} <=> $tags{$a} } keys %tags ){
my $freq = $tags{$tag}; # to clarify things below
my $unique_string = make_unique_string($tag, $freq);
print ">$unique_string\t$tag\t$freq\n";
}
| [reply] [d/l] |
|
|
Re: Sorting issue
by aaron_baugher (Curate) on Nov 04, 2011 at 21:30 UTC
|
You're splitting the line on commas (after changing tabs to commas, which is puzzling), then saving each line in a hash with the key being the first element from your split. So the hash key that you're sorting by is that first column. If you want to sort by something else, you have to tell the sort function that.
To make another column easily available to sort, and to avoid duplicating work you've already done, save @columns in your hash instead of the original line. Then you'll have a hash of arrays, so you can sort on whichever element of the array you'd like:
$tag{$columns[0]} = \@columns;
}
foreach my $tags ( sort { $tag{$a}[1] <=> $tag{$b}[1] } keys %tag ){
In this case, I'm using <=> to sort numerically, based on the second element of the array pointed to by each hash key's value. To sort alphabetically, change <=> to cmp. Now you can get your array back into @columns with the dereference @{$tag{$tags}}, so you don't have to re-split your line.
One concern: you said you're trying to come up with a unique key for each line, but you're using the first column alone as the key when you put them in the hash. If the values from the first column aren't already unique, you'll be overwriting values there, so lines will already be missing by the time you sort and start adding your other parts. If you need to add the frequency and a random number to get a unique key (and I have a feeling there's a better way to do that than with random numbers, which could repeat), you should do that before you save the key in your hash. | [reply] [d/l] [select] |
|
|
Hi Aaron,
Thanks for the reply. I tried your suggestion, but it was not working as I found that there was repeats in the second column (frequency) of my input file. So, I changed the format of first column of output by concatenating the tags to the end. In that way, it looks unique. Now, I would like to sort on the first column of my output. I did create another hash to make it work. So far it is not successful. The codes that I changed are below. Any suggestions will be helpful
while (my $line=<$FILE1>)
{
chomp $line;
$line=~s/\t/,/g;
my @columns=split(/,/, $line);
my $tags=$columns[0];
#$tag{$columns[0]}=\@columns;
$tag{$tags}=$line;
}
foreach my $tags (keys %tag){
my $header;
my $range=500000;
my @columns=split(/,/,$tag{$tags});
$tags=$columns[0];
my $freq=$columns[1];
my $random_number=int(rand($range));
$header=">HWTI_".$freq."_".$random_number.$tags;
$header=~tr/"//d;
my $printline=$tag{$tags};
$printline=$header.",".$printline;
print $FILE2 "$printline\n";
}
| [reply] [d/l] |
|
|
You didn't show how you tried my suggestions, so I'm not sure why it didn't work for you. Here's a more complete example, which takes your sample input and sorts it by the frequencies (largest to smallest), outputting with a header built to your latest spec. Make sure you understand what's going on in the sort {block}: what $tags{$a} means, for instance. I'm sorting on the values, not the keys. The keys go into $a and $b, and I'm using those as keys into the hash to sort on the values.
#!/usr/bin/perl
use warnings;
use strict;
my %tags; # hash to store tags/freqs
while(<DATA>){
chomp;
my($tag, $freq) = split; # split the line on whitespace
$tags{$tag} = $freq; # save the tag and freq in the hash
}
# sort the hash numerically on its values, descending
for my $tag ( sort { $tags{$b} <=> $tags{$a} } keys %tags ){
my $freq = $tags{$tag}; # put the freq for $tag in $freq
my $header = make_header($tag, $freq); # make the header
print ">$header\t$tag\t$freq\n"; # print it out
}
sub make_header {
my $tag = shift; # get parameters
my $freq = shift;
my $r = int(rand(500000)); # pick a random number
return "HWTI_${freq}_$r$tag"; # build the header
}
#input data
__DATA__
CCCDEDFFFES 45
EEBBBBGGGBB 1700
BBBCDDERFGG 850
#output
>HWTI_1700_494932EEBBBBGGGBB EEBBBBGGGBB 1700
>HWTI_850_10814BBBCDDERFGG BBBCDDERFGG 850
>HWTI_45_187939CCCDEDFFFES CCCDEDFFFES 45
| [reply] [d/l] [select] |
|
|
|
|
|
|
|
|
|
Re: Sorting issue
by jdporter (Paladin) on Nov 05, 2011 at 00:44 UTC
|
Clearly what you need is a "custom sort" criteria block. So instead of just sort keys %tag as your code above has it, you can do something like this:
foreach my $tags ( sort {
(split /,/,$a)[1] <=> (split /,/,$b)[1]
} keys %tag ) {
Of course, this could be optimized, and that might be important if your input file is huge.
I reckon we are the only monastery ever to have a dungeon stuffed with 16,000 zombies .
| [reply] [d/l] [select] |
|
|
Hi jdporter,
I have tried the custom sort previously. But it gave the error message "Use of uninitiated value.. " for each line. So, I checked the input file for repeats in the frequency column and it is repeated. I guess, there will be a clash in the way hash stores each values. In the first column of output file, I also concatenated the tag column (first column of input file) to make it unique. Now, I am thinking of sorting with respect to the first column of output file. So far, not successful. Currently, my output file look like this:
Header Tags Frequency
>HWTI_2_78439EEEEEMMMMMG EEEEEMMMMMG 2
>HWTI_3_338554FFEFFFDFEMM FFEFFFDFEMM 3
-------------------------------------------
| [reply] [d/l] |
|
|
Duplicate values in a hash aren't a problem; only duplicate keys are. So if your tags are never repeated, you'll be fine putting each input line's tag as the hash key and the frequency as its value.
When it comes time to sort, you can only sort your hash on something that's in your hash. So if your hash contains the tags and frequencies, you can sort on either of those (see my last reply for how to sort on the values); but you can't sort on the header that you haven't created yet.
| [reply] |
|
|
Re: Sorting issue
by JavaFan (Canon) on Nov 04, 2011 at 21:02 UTC
|
I'm not sure whether I understand what you want, but could it be as simple as:
open my FILE1, "sort -nk2 $input_file |" or die;
?
(2-arg open, because I can never remember whether I need "|-" or "-|" for 3-arg open; using 2-arg open beats looking it up in the manual). | [reply] [d/l] |
|
|
| [reply] [d/l] [select] |
|
|
Yeah, and that's so very confusing. See, for me, '-' just screams STANDARD (IN|OUT)PUT. So, '|-' just looks like I get to read from the programs standard output, and '-|' means I get to write to the programs standard input. Which is just the other way around of what it really is. :-/
| [reply] |
|
|
Re: Sorting issue
by planetscape (Chancellor) on Nov 05, 2011 at 15:15 UTC
|
| [reply] [d/l] |