in reply to Re: Creating a column of frequency for the unique entries of another column
in thread Creating a column of frequency for the unique entries of another column

Hi Marshall,

Thanks for the reply. I was able to fix the heading problem in the output. I forgot to add the "print $headings;" outside the second foreach loop (while loop in yours) in my script. When I added it in the while loop (in my program), that was solved. But, I still have one additional problem, i.e. to get the frequency of the rows in my output column as a second column. Let me explain it more clearly. My input file is below:

@HWDFFFDDABCDEFFFFDDDDDRFFFFEFFEEDDABCDEDDDDDD @HWDFFFDDABCDEFFDFEDEDDRFFFFEFFEEDDABCDEDDDDDD @HWDFFFDDABCDEFFFFDDDDDRFFFFEFFEEDDABCDEDDDDDD @HWDFFFDDABCDEFFFFDDDDDRFFFFEFFEEDDABCDEDDDDDD @HWDFFFDDABCDEFFDFEDEDDRFFFFEFFEEDDABCDEDDDDDD .............................................

What i did so far was to match the two 'ABCD's in each row and report the 22 characters between the 'ABCD's in the output file. The 22 characters are split into half (11 characters) and placed it in the same column in my output and also added the headings ('Tags' and frequency). If you look at the input, the first, third, and fourth are the same, and also the second and fifth are the same. My desired output will be:

Tags Frequency EFFFFDDDDDR 3 FFFFEFFEEDD 3 EFFDFEDEDDR 2 FFFFEFFEEDD 2 ............

My current output includes repeated or same sequences in each row and doesn't have the frequency. I would like to eliminate these repeated sequence and replace it with single sequence and their frequency for each row. Hope it is clear. Thanks again.

Replies are listed 'Best First'.
Re^3: Creating a column of frequency for the unique entries of another column
by aaron_baugher (Curate) on Oct 29, 2011 at 17:24 UTC

    When you want to count the number of times that each of an assortment of things appears, that usually means you want to use a hash, with the 'things' as the keys and the count kept as the values. In this case, use the sequences as your hash keys and increment each one's value each time it appears:

    # my %freq; #<---- uncomment this somewhere before your loop # through the file to instantiate the hash if ($line=~m/$sequence(.{11})(.{11})$sequence/){ $freq{$1}++; $freq{$2}++; } } # after all lines are read in.... # sort through the sequences, highest frequency to lowest for my $seq (sort { $freq{$b} <=> $freq{$a} } keys %freq){ print "$seq $freq{$seq}\n"; }