bluray has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perlmonks,

My intention to sort by the third column in the output is creating me trouble with the code I wrote. This code is a follow up code on the output generated with another code (Title: Reverse complement) bluray's writeups. Now, my aim is to create unique identifiers for each of the 11 characters (tag) in the 1st column of the input file. The 2nd column is the frequency of each tag. I used this code to create a new first column. Each line of this column starts with >HWTI_frequency_randomnumber. Though, I was able to get the results with this code, I have trouble in sorting with the second column of my input file.

#!usr/bin/perl use strict; use warnings; my $input_file="file1.seq"; open my $FILE1, "<", $input_file or die "Cannot open $input_file .$!"; my $output_file="file2.csv"; open my $FILE2, ">", $output_file or die "Cannot open $output_file .$! +"; my $line=<$FILE1>; chomp $line; my @columnheadings=split(/\t/, $line); unshift(@columnheadings, ("Header")); my $heading=join("\t", @columnheadings); print $FILE2 "$heading\n"; my %tag; while (my $line=<$FILE1>) { chomp $line; $line=~s/\t/,/g; my @columns=split(/,/, $line); my $tags=$columns[0]; $tag{$tags}=$line; } foreach my $tags (sort keys %tag){ my $header; my @columns=split(/,/,$tag{$tags}); $tags=$columns[0]; my $freq=$columns[1]; my $range=500000; my $random_number=int(rand($range)); $header=">HWTI_".$freq."_".$random_number; my $printline=$tag{$tags}; $printline=$header.",".$printline; print $FILE2 "$printline\n"; }

In addition to the sorting issue, I am also thinking about doing a BLAST for each of these tags (#nucleotide length of 11). I will appreciate any suggestions in this matter.

Replies are listed 'Best First'.
Re: Sorting issue
by GrandFather (Saint) on Nov 04, 2011 at 20:54 UTC

    Nope, I just can't do it. I tried to read your mind to determine what your input data looks like and to see how the sort was different than you want, but I just can't do it. Maybe you are too far away, or asleep, or just don't broadcast very well, but I failed. Sorry.

    I suggest though that you take a hard look at your use of $tags in your foreach loop. Assigning to the loop variable seems wrong to me. There are of course times when that is the thing to do, but your code doesn't make it clear that changing the content of the loop variable is the intended behaviour as you use the same hash lookup ($tag{$tags}) in two places having changed $tags between times. If that is what you want I'd create a new variable to make it clear that that is the intent.

    True laziness is hard work
      Hi Grandfather,

      I forgot to post my input and desired output format. Sorry!

      #Input file Tags Frequency EEBBBBGGGBB 1700 BBBCDDERFGG 850 CCCDEDFFFES 45 ----------- -- #output file Header Tags Frequency >HWTI_1700_468983 EEBBBBGGGBB 1700 >HWTI_850_52 BBBCDDERFGG 850 ------------

      With my code, I am able to sort it by Tags, but I want to sort by Frequency. Though, I tried the suggestion by "aaron_baugher", I am getting warning "use of uninitiated value..."

        Ok, that helps. If the tags in your input file are guaranteed to be unique, it's easy. Put them in a hash with the frequencies as the values, and then sort on the values. In this example, %tags is the hash that stores the tags and their corresponding values, and then it's sorted on the values numerically, largest to smallest. The sub make_unique_string() creates the unique key for your output file from the tag and freq.

        my %tags; while(<$input_file_descriptor>){ # do stuff to skip headers and blank lines chomp; my( $tag, $freq ) = split /\s+/; $tags{$tag} = $freq; } for my $tag (sort { $tags{$b} <=> $tags{$a} } keys %tags ){ my $freq = $tags{$tag}; # to clarify things below my $unique_string = make_unique_string($tag, $freq); print ">$unique_string\t$tag\t$freq\n"; }
Re: Sorting issue
by aaron_baugher (Curate) on Nov 04, 2011 at 21:30 UTC

    You're splitting the line on commas (after changing tabs to commas, which is puzzling), then saving each line in a hash with the key being the first element from your split. So the hash key that you're sorting by is that first column. If you want to sort by something else, you have to tell the sort function that.

    To make another column easily available to sort, and to avoid duplicating work you've already done, save @columns in your hash instead of the original line. Then you'll have a hash of arrays, so you can sort on whichever element of the array you'd like:

    $tag{$columns[0]} = \@columns; } foreach my $tags ( sort { $tag{$a}[1] <=> $tag{$b}[1] } keys %tag ){

    In this case, I'm using <=> to sort numerically, based on the second element of the array pointed to by each hash key's value. To sort alphabetically, change <=> to cmp. Now you can get your array back into @columns with the dereference @{$tag{$tags}}, so you don't have to re-split your line.

    One concern: you said you're trying to come up with a unique key for each line, but you're using the first column alone as the key when you put them in the hash. If the values from the first column aren't already unique, you'll be overwriting values there, so lines will already be missing by the time you sort and start adding your other parts. If you need to add the frequency and a random number to get a unique key (and I have a feeling there's a better way to do that than with random numbers, which could repeat), you should do that before you save the key in your hash.

      Hi Aaron,

      Thanks for the reply. I tried your suggestion, but it was not working as I found that there was repeats in the second column (frequency) of my input file. So, I changed the format of first column of output by concatenating the tags to the end. In that way, it looks unique. Now, I would like to sort on the first column of my output. I did create another hash to make it work. So far it is not successful. The codes that I changed are below. Any suggestions will be helpful

      while (my $line=<$FILE1>) { chomp $line; $line=~s/\t/,/g; my @columns=split(/,/, $line); my $tags=$columns[0]; #$tag{$columns[0]}=\@columns; $tag{$tags}=$line; } foreach my $tags (keys %tag){ my $header; my $range=500000; my @columns=split(/,/,$tag{$tags}); $tags=$columns[0]; my $freq=$columns[1]; my $random_number=int(rand($range)); $header=">HWTI_".$freq."_".$random_number.$tags; $header=~tr/"//d; my $printline=$tag{$tags}; $printline=$header.",".$printline; print $FILE2 "$printline\n"; }

        You didn't show how you tried my suggestions, so I'm not sure why it didn't work for you. Here's a more complete example, which takes your sample input and sorts it by the frequencies (largest to smallest), outputting with a header built to your latest spec. Make sure you understand what's going on in the sort {block}: what $tags{$a} means, for instance. I'm sorting on the values, not the keys. The keys go into $a and $b, and I'm using those as keys into the hash to sort on the values.

        #!/usr/bin/perl use warnings; use strict; my %tags; # hash to store tags/freqs while(<DATA>){ chomp; my($tag, $freq) = split; # split the line on whitespace $tags{$tag} = $freq; # save the tag and freq in the hash } # sort the hash numerically on its values, descending for my $tag ( sort { $tags{$b} <=> $tags{$a} } keys %tags ){ my $freq = $tags{$tag}; # put the freq for $tag in $freq my $header = make_header($tag, $freq); # make the header print ">$header\t$tag\t$freq\n"; # print it out } sub make_header { my $tag = shift; # get parameters my $freq = shift; my $r = int(rand(500000)); # pick a random number return "HWTI_${freq}_$r$tag"; # build the header } #input data __DATA__ CCCDEDFFFES 45 EEBBBBGGGBB 1700 BBBCDDERFGG 850
        #output >HWTI_1700_494932EEBBBBGGGBB EEBBBBGGGBB 1700 >HWTI_850_10814BBBCDDERFGG BBBCDDERFGG 850 >HWTI_45_187939CCCDEDFFFES CCCDEDFFFES 45
Re: Sorting issue
by jdporter (Paladin) on Nov 05, 2011 at 00:44 UTC

    Clearly what you need is a "custom sort" criteria block. So instead of just sort keys %tag as your code above has it, you can do something like this:

    foreach my $tags ( sort { (split /,/,$a)[1] <=> (split /,/,$b)[1] } keys %tag ) {

    Of course, this could be optimized, and that might be important if your input file is huge.

    I reckon we are the only monastery ever to have a dungeon stuffed with 16,000 zombies.
      Hi jdporter,

      I have tried the custom sort previously. But it gave the error message "Use of uninitiated value.. " for each line. So, I checked the input file for repeats in the frequency column and it is repeated. I guess, there will be a clash in the way hash stores each values. In the first column of output file, I also concatenated the tag column (first column of input file) to make it unique. Now, I am thinking of sorting with respect to the first column of output file. So far, not successful. Currently, my output file look like this:

      Header Tags Frequency >HWTI_2_78439EEEEEMMMMMG EEEEEMMMMMG 2 >HWTI_3_338554FFEFFFDFEMM FFEFFFDFEMM 3 -------------------------------------------

        Duplicate values in a hash aren't a problem; only duplicate keys are. So if your tags are never repeated, you'll be fine putting each input line's tag as the hash key and the frequency as its value.

        When it comes time to sort, you can only sort your hash on something that's in your hash. So if your hash contains the tags and frequencies, you can sort on either of those (see my last reply for how to sort on the values); but you can't sort on the header that you haven't created yet.

Re: Sorting issue
by JavaFan (Canon) on Nov 04, 2011 at 21:02 UTC
    I'm not sure whether I understand what you want, but could it be as simple as:
    open my FILE1, "sort -nk2 $input_file |" or die;
    ?

    (2-arg open, because I can never remember whether I need "|-" or "-|" for 3-arg open; using 2-arg open beats looking it up in the manual).

      (2-arg open, because I can never remember whether I need "|-" or "-|" for 3-arg open; using 2-arg open beats looking it up in the manual).

      Just imagine the - being replaced by the program name. So |- is like |consumer (and you can write to the file handle in your Perl code), and -| is like producer| (and you can read from the file handle in your Perl code).

        Yeah, and that's so very confusing. See, for me, '-' just screams STANDARD (IN|OUT)PUT. So, '|-' just looks like I get to read from the programs standard output, and '-|' means I get to write to the programs standard input. Which is just the other way around of what it really is. :-/
Re: Sorting issue
by planetscape (Chancellor) on Nov 05, 2011 at 15:15 UTC