Creating a column of frequency for the unique entries of another column

bluray has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perlmonks,

This post is a followup to the one I posted earlier. My original aim was to match a four letter word ('ABCD') in the input file rows and then report 10 succeeding characters after the matched word in the output file. I was able to do that. Then, I found that 'ABCD' is repeated twice in each row with 22 characters between them. This 22 characters should be split into two (11 characters each) and should be reported in the same column. I was able to do that also. But, now my problem is to get a heading for the file. The code below gives the heading, but it is repeated after every 2 lines (22 characters). Also, I need to reformat the file to give the frequency of each unique row in the next column (in effect reduces the number of rows ).


#!usr/bin/perl -w
use strict;
use warnings;

my @input_files=<*.seq>;
my $local_count=0;

my %hash;
foreach my $input_file(@input_files)
{
unless (open(INPUT, $input_file))
{
    print "Cannot open file \"$input_file\"\n\n";
    exit;
}

my $sequence='ABCD';
my @headings=('Tags', 'Frequency');
my $headings=join("\t",@headings);
while (my $line=<INPUT>)

{
    
   if ($local_count==0){
    my $outfile=$input_file;
    $outfile=~s/.seq/.tag.txt/gi;
    unless (open (OUTPUT, ">$outfile"))
    {
        print "Cannot open file \"$outfile\"\n\n";
        exit;
    }
    }
    chomp $line;
 
   
   foreach($line=~m/$sequence/i){
             if ($line=~m/$sequence(.{11})(.{11})$sequence/){
         print OUTPUT  "\n",$headings,"\n",$1,"\n",$2;
                                
       }
         $local_count++;
         
    }
    
    }


}
[download]

The output I am getting now is in this format below:


Tags        Frequency
CDDDDDDDDDD    
BCDDEDDDDDR    
Tags       Frequency    
CDEDEDDDESE
CEEESEEDESE    
Tags       Frequency
[download]

Comment on Creating a column of frequency for the unique entries of another column Select or Download Code

Replies are listed 'Best First'.
Re: Creating a column of frequency for the unique entries of another column by toolic (Bishop) on Oct 28, 2011 at 18:43 UTC
I think you want to print headings only outside your foreach loop: `print OUTPUT "\n", $headings; foreach ( $line =~ m/$sequence/i ) { if ( $line =~ m/$sequence(.{11})(.{11})$sequence/ ) { print OUTPUT "\n", $1, "\n", $2; } $local_count++; }` [download] Aside: perltidy is nice.	[reply] [d/l]
Re: Creating a column of frequency for the unique entries of another column by Cristoforo (Curate) on Oct 29, 2011 at 20:28 UTC
`Tags Frequency EFFFFDDDDDR 3 FFFFEFFEEDD 3 EFFDFEDEDDR 2 FFFFEFFEEDD 2 ............` [download] I got different results from your dataset using the code below. I am assuming you want a new output file for every `.seq` input file. You would just need to uncomment the 4 commented statements and change the `foreach my $input_file ('o66.txt')` line to `foreach my $input_file (@input_files)`. #!usr/bin/perl use strict; use warnings; my $sequence='ABCD'; my @headings= qw/ Tags Frequency /; my @input_files=<.seq>; foreach my $input_file ('o66.txt') { open INPUT, "<", $input_file or die "Cannot open file \"$input_fil +e\". $!"; (my $outfile = $input_file) =~ s/.seq/.tag.txt/i; my %freq; while (my $line=<INPUT>) { if ($line=~m/$sequence(.{11})(.{11})$sequence/i){ $freq{$_}++ for $1, $2; } } close INPUT or die "Cannot close file \"$input_file\". $!"; #open OUTPUT, ">", $outfile or die "Cannot open file \"$outfile\". + $!"; #printf OUTPUT "%-12s%s\n", @headings; printf "%-12s%s\n", @headings; for my $tag (sort {$freq{$b} <=> $freq{$a}} keys %freq) { #printf OUTPUT "%-12s%5s\n", $tag, $freq{ $tag }; printf "%-12s%5s\n", $tag, $freq{ $tag}; } #close OUTPUT or die "Unable to close \"$outfile\". $!"; } __END__ o66.txt is below: @HWDFFFDDABCDEFFFFDDDDDRFFFFEFFEEDDABCDEDDDDDD @HWDFFFDDABCDEFFDFEDEDDRFFFFEFFEEDDABCDEDDDDDD @HWDFFFDDABCDEFFFFDDDDDRFFFFEFFEEDDABCDEDDDDDD @HWDFFFDDABCDEFFFFDDDDDRFFFFEFFEEDDABCDEDDDDDD @HWDFFFDDABCDEFFDFEDEDDRFFFFEFFEEDDABCDEDDDDDD output is: C:\Old_Data\perlp>perl t.pl Tags Frequency FFFFEFFEEDD 5 EFFFFDDDDDR 3 EFFDFEDEDDR 2 [download]	[reply] [d/l] [select]
Re^2: Creating a column of frequency for the unique entries of another column by bluray (Sexton) on Oct 29, 2011 at 21:10 UTC
Thanks Cristoforo, I used the .seq because I have several input files. Anyway, if you delete the output file from the same directory, you would get the same result when you run the script again.	[reply]
Re: Creating a column of frequency for the unique entries of another column by Marshall (Canon) on Oct 29, 2011 at 11:18 UTC
Your code is very confusing to me as I haven't seen the previous post. Is something like this what you meant? #!usr/bin/perl -w use strict; my @input_files=<*.seq>; my $sequence='ABCD'; my @headings=('Tags', 'Frequency'); my $headings=join("\t",@headings); #"Tags\nFrequency" foreach my $input_file(@input_files) { open(INPUT, '<', $input_file) or die "Cannot open file: $!\n"; my $outfile=$input_file; $outfile =~ s/.seq/.tag.txt/i; #/g makes no sense open (OUTPUT, '>', $outfile") or die "Cannot open file $!\n"; print OUTPUT "\n$headings\n"; while (my $line=<INPUT>) { if ($line=~m/$sequence(.{11})(.{11})$sequence/o) { print OUTPUT "$1\n$2\n"; } } } [download] I'm sure that there is still stuff wrong with this code. Can you show one input file (a few lines) and the expected output of your program?	[reply] [d/l]
Re^2: Creating a column of frequency for the unique entries of another column by bluray (Sexton) on Oct 29, 2011 at 16:50 UTC
Hi Marshall, Thanks for the reply. I was able to fix the heading problem in the output. I forgot to add the "print $headings;" outside the second foreach loop (while loop in yours) in my script. When I added it in the while loop (in my program), that was solved. But, I still have one additional problem, i.e. to get the frequency of the rows in my output column as a second column. Let me explain it more clearly. My input file is below: `@HWDFFFDDABCDEFFFFDDDDDRFFFFEFFEEDDABCDEDDDDDD @HWDFFFDDABCDEFFDFEDEDDRFFFFEFFEEDDABCDEDDDDDD @HWDFFFDDABCDEFFFFDDDDDRFFFFEFFEEDDABCDEDDDDDD @HWDFFFDDABCDEFFFFDDDDDRFFFFEFFEEDDABCDEDDDDDD @HWDFFFDDABCDEFFDFEDEDDRFFFFEFFEEDDABCDEDDDDDD .............................................` [download] What i did so far was to match the two 'ABCD's in each row and report the 22 characters between the 'ABCD's in the output file. The 22 characters are split into half (11 characters) and placed it in the same column in my output and also added the headings ('Tags' and frequency). If you look at the input, the first, third, and fourth are the same, and also the second and fifth are the same. My desired output will be: `Tags Frequency EFFFFDDDDDR 3 FFFFEFFEEDD 3 EFFDFEDEDDR 2 FFFFEFFEEDD 2 ............` [download] My current output includes repeated or same sequences in each row and doesn't have the frequency. I would like to eliminate these repeated sequence and replace it with single sequence and their frequency for each row. Hope it is clear. Thanks again.	[reply] [d/l] [select]
Re^3: Creating a column of frequency for the unique entries of another column by aaron_baugher (Curate) on Oct 29, 2011 at 17:24 UTC
When you want to count the number of times that each of an assortment of things appears, that usually means you want to use a hash, with the 'things' as the keys and the count kept as the values. In this case, use the sequences as your hash keys and increment each one's value each time it appears: `# my %freq; #<---- uncomment this somewhere before your loop # through the file to instantiate the hash if ($line=~m/$sequence(.{11})(.{11})$sequence/){ $freq{$1}++; $freq{$2}++; } } # after all lines are read in.... # sort through the sequences, highest frequency to lowest for my $seq (sort { $freq{$b} <=> $freq{$a} } keys %freq){ print "$seq $freq{$seq}\n"; }` [download]	[reply] [d/l]