Match Count

de2425 has asked for the wisdom of the Perl Monks concerning the following question:

I am desperately trying to figure out how to accomplish a match count between two files. I'm sure that this is something that is very simple but as a novice I am failing miserably at it.

What I have is two files with data within them and I need to count matches of specific values between them. The first file one might consider a master list of sorts. What I'm wanting is to take the numeric list from that file and count how many times each number occurs within the second file. I would like to then print out the Alpha numeric data from the master list that is associated with the numeric data along with a count of how many times that item occurs in the second file.

The data from the first file looks approximately like:

Name, description, ID#

The data from the second file has the same type of data but the lists is shorter and there are several lists. It looks like:

Name, description, ID#, Name, description, ID#,Name, description, ID#

The Output I'm looking for is:

Name, ID#, #of times matched

I have tried several different things but have not had success with it at all. My sample code is below. If anyone could offer any constructive help, I would very much appreciate it


#!/usr/bin/perl -w

open (IN, "c:/work/Cytokine_By_Company/ING_cytokines_20080805.txt");

     while (<IN>){
            chomp;
           @t=split(/\t/,$_);
           $ING{$t[9]}=$t[1];
           #print %ING;
     }
close IN;

open (OUT, ">c:/work/GeneID_Count/Cytokine.txt")||die "I'm not dead ye
+t";
open (IN, "c:/work/Cytokine_By_Company/CytokineArrays.txt")||die "I'm 
+Dead!!!!";

     while(<IN>){
     chomp;
     @cytokine=split(/\t/,$_);
          
     while (/\d+/ and exists $ING{$cytokine}){
            $count++;}
             
     foreach $cytokine (sort{$ING{$b}<=>$ING{$a}} keys $ING){
            print OUT "$ING{cytokine}\t$count\n";}
           
      }       
close IN;
close OUT;
[download]

Thanks to everyone for their help. Between all of your comments and some thinking, I finally got the code to generate what I needed it to generate. I also understand your comments about declaring my variables, however, the person I'm working under gets very frustrated with me when I do this. Please don't ask me why. Therefore, I leave them out. Anyway, my new code looks like this:

#!/usr/bin/perl -w

#open(OUT, ">c:/work/new_list.txt");
open (IN, "c:/work/GeneID_Count/CytokineList.txt")||die "Could not ope
+n Cytokine Arrays.txt";

%seen = ();
while(<IN>){ 
     chomp;
     @cytokine=split(/\t/,$_);
     $seen{$cytokine[0]}++;
    
}
close IN;

open (OUT, ">c:/work/GeneID_Count/Cytokine.txt")||die "Cound not creat
+e Cytokine.txt";
open (IN, "c:/work/Cytokine_By_Company/ING_cytokines_20080805.txt")|| 
+die "Could not open ING_cytokines_20080805.txt";

while (<IN>){
      chomp;
      @ING=split(/\t/,$_);
      if ($ING[9]=~/\d/ and exists $seen{$ING[9]}){
           print OUT "$ING[1]\t$ING[9]\t$seen{$ING[9]}\n";
      }
}

close IN;
close OUT;
[download]

Thank you all again for all of your help.

Comment on Match Count Select or Download Code

Replies are listed 'Best First'.
Re: Match Count by shmem (Chancellor) on Sep 08, 2008 at 15:37 UTC
You have improved slightly compared to Count Matches in a File, but I guess you could do better. Maybe re-reading that thread again (it is not long) helps a bit along the way. `@cytokine=split(/\t/,$_); while (/\d+/ and exists $ING{$cytokine}){ # huh? ---------------------------^^^^^^^^^ $count++;}` [download] Where does `$cytokine` get set? And why are you incrementing `$count`, never using it later? `@cytokine` and `$cytokine` are two unrelated variables; the former is an array, while the latter is a scalar. See perldata.	[reply] [d/l] [select]
Re: Match Count by moritz (Cardinal) on Sep 08, 2008 at 14:35 UTC
Please start all your scripts with `use strict; use warnings;` [download] And declare your variables. It catches at least one error that you make in your script. (As I already told you in the thread Count Matches in a File - why don't you heed the advice? It's good advice after all). This looks also wrong: `$ING{$t[9]}=$t[1];` Array indexes start with 0, so you'll probably want `$ING{$t[8]} = $t[0];` instead.	[reply] [d/l] [select]
Re: Match Count by apl (Monsignor) on Sep 08, 2008 at 15:10 UTC
You might want to put a `\|\| die` after your open of ING_cytokines_20080805.txt. You should also display the name of the file and the returned error code in each die.	[reply] [d/l]
Re: Match Count by dwm042 (Priest) on Sep 08, 2008 at 16:15 UTC
de2425, this is a fast and dirty rewrite of your code, incorporating some of moritz's and shmem's comments. Further, I'll note you don't get keys from $ING, but rather from the hash %ING. The code below has not been run but it has been checked for syntax with `perl -cw`. I'll note that most of what I've done is clean up the variable syntax (and I'm sure someone else could do a better job of it). I've also taken the time to eliminate a lot of those hard coded files and directory settings. The name of the output file can now be passed as a parameter as well. #!/usr/bin/perl use warnings; use strict; # # Rewrite for legibility. # This code is not tested. # my $output_file = shift \|\| "Cytokine.txt"; my $work_dir = "c:/work/Cytokine_By_Company"; my $report_dir = "c:/work/GeneID_Count"; my $data_set_one = "ING_cytokines_200805.txt"; my $data_set_two = "CytokineArrays.txt"; my %ING; open (IN, "< $work_dir/$data_set_one") or die "Could not open input file $data_set_one. $!\n"; while (<IN>) { chomp; my @source_data = split(/\t/,$_); $ING{$source_data[8]}{name} = $source_data[0]; $ING{$source_data[8]}{count} = 0; } close IN; open (OUT, "> $report_dir/$output_file") or die "Could not open file $output_file. $!\n"; open (IN, "< $work_dir/$data_set_two") or die "Could not open file $data_set_two. $!\n"; while(<IN>) { chomp; my @target_data = split(/\t/,$_); my $cytokine_id = $target_data[2]; while (/\d+/ and exists $ING{$cytokine_id}{name}) { $ING{$cytokine_id}{count}++; } for my $id ( sort{ $ING{$b} <=> $ING{$a} } keys %ING) { print OUT "$ING{$id}{name}\t$ING{$id}{count}\n"; } } close IN; close OUT; [download]	[reply] [d/l]