in reply to Creating a column of frequency for the unique entries of another column

Your code is very confusing to me as I haven't seen the previous post. Is something like this what you meant?
#!usr/bin/perl -w use strict; my @input_files=<*.seq>; my $sequence='ABCD'; my @headings=('Tags', 'Frequency'); my $headings=join("\t",@headings); #"Tags\nFrequency" foreach my $input_file(@input_files) { open(INPUT, '<', $input_file) or die "Cannot open file: $!\n"; my $outfile=$input_file; $outfile =~ s/.seq/.tag.txt/i; #/g makes no sense open (OUTPUT, '>', $outfile") or die "Cannot open file $!\n"; print OUTPUT "\n$headings\n"; while (my $line=<INPUT>) { if ($line=~m/$sequence(.{11})(.{11})$sequence/o) { print OUTPUT "$1\n$2\n"; } } }
I'm sure that there is still stuff wrong with this code. Can you show one input file (a few lines) and the expected output of your program?
  • Comment on Re: Creating a column of frequency for the unique entries of another column
  • Download Code

Replies are listed 'Best First'.
Re^2: Creating a column of frequency for the unique entries of another column
by bluray (Sexton) on Oct 29, 2011 at 16:50 UTC
    Hi Marshall,

    Thanks for the reply. I was able to fix the heading problem in the output. I forgot to add the "print $headings;" outside the second foreach loop (while loop in yours) in my script. When I added it in the while loop (in my program), that was solved. But, I still have one additional problem, i.e. to get the frequency of the rows in my output column as a second column. Let me explain it more clearly. My input file is below:

    @HWDFFFDDABCDEFFFFDDDDDRFFFFEFFEEDDABCDEDDDDDD @HWDFFFDDABCDEFFDFEDEDDRFFFFEFFEEDDABCDEDDDDDD @HWDFFFDDABCDEFFFFDDDDDRFFFFEFFEEDDABCDEDDDDDD @HWDFFFDDABCDEFFFFDDDDDRFFFFEFFEEDDABCDEDDDDDD @HWDFFFDDABCDEFFDFEDEDDRFFFFEFFEEDDABCDEDDDDDD .............................................

    What i did so far was to match the two 'ABCD's in each row and report the 22 characters between the 'ABCD's in the output file. The 22 characters are split into half (11 characters) and placed it in the same column in my output and also added the headings ('Tags' and frequency). If you look at the input, the first, third, and fourth are the same, and also the second and fifth are the same. My desired output will be:

    Tags Frequency EFFFFDDDDDR 3 FFFFEFFEEDD 3 EFFDFEDEDDR 2 FFFFEFFEEDD 2 ............

    My current output includes repeated or same sequences in each row and doesn't have the frequency. I would like to eliminate these repeated sequence and replace it with single sequence and their frequency for each row. Hope it is clear. Thanks again.

      When you want to count the number of times that each of an assortment of things appears, that usually means you want to use a hash, with the 'things' as the keys and the count kept as the values. In this case, use the sequences as your hash keys and increment each one's value each time it appears:

      # my %freq; #<---- uncomment this somewhere before your loop # through the file to instantiate the hash if ($line=~m/$sequence(.{11})(.{11})$sequence/){ $freq{$1}++; $freq{$2}++; } } # after all lines are read in.... # sort through the sequences, highest frequency to lowest for my $seq (sort { $freq{$b} <=> $freq{$a} } keys %freq){ print "$seq $freq{$seq}\n"; }