de2425 has asked for the wisdom of the Perl Monks concerning the following question:

I am desperately trying to figure out how to accomplish a match count between two files. I'm sure that this is something that is very simple but as a novice I am failing miserably at it.

What I have is two files with data within them and I need to count matches of specific values between them. The first file one might consider a master list of sorts. What I'm wanting is to take the numeric list from that file and count how many times each number occurs within the second file. I would like to then print out the Alpha numeric data from the master list that is associated with the numeric data along with a count of how many times that item occurs in the second file.

The data from the first file looks approximately like:

Name, description, ID#

The data from the second file has the same type of data but the lists is shorter and there are several lists. It looks like:

Name, description, ID#, Name, description, ID#,Name, description, ID#

The Output I'm looking for is:

Name, ID#, #of times matched

I have tried several different things but have not had success with it at all. My sample code is below. If anyone could offer any constructive help, I would very much appreciate it

#!/usr/bin/perl -w open (IN, "c:/work/Cytokine_By_Company/ING_cytokines_20080805.txt"); while (<IN>){ chomp; @t=split(/\t/,$_); $ING{$t[9]}=$t[1]; #print %ING; } close IN; open (OUT, ">c:/work/GeneID_Count/Cytokine.txt")||die "I'm not dead ye +t"; open (IN, "c:/work/Cytokine_By_Company/CytokineArrays.txt")||die "I'm +Dead!!!!"; while(<IN>){ chomp; @cytokine=split(/\t/,$_); while (/\d+/ and exists $ING{$cytokine}){ $count++;} foreach $cytokine (sort{$ING{$b}<=>$ING{$a}} keys $ING){ print OUT "$ING{cytokine}\t$count\n";} } close IN; close OUT;

Thanks to everyone for their help. Between all of your comments and some thinking, I finally got the code to generate what I needed it to generate. I also understand your comments about declaring my variables, however, the person I'm working under gets very frustrated with me when I do this. Please don't ask me why. Therefore, I leave them out. Anyway, my new code looks like this:

#!/usr/bin/perl -w #open(OUT, ">c:/work/new_list.txt"); open (IN, "c:/work/GeneID_Count/CytokineList.txt")||die "Could not ope +n Cytokine Arrays.txt"; %seen = (); while(<IN>){ chomp; @cytokine=split(/\t/,$_); $seen{$cytokine[0]}++; } close IN; open (OUT, ">c:/work/GeneID_Count/Cytokine.txt")||die "Cound not creat +e Cytokine.txt"; open (IN, "c:/work/Cytokine_By_Company/ING_cytokines_20080805.txt")|| +die "Could not open ING_cytokines_20080805.txt"; while (<IN>){ chomp; @ING=split(/\t/,$_); if ($ING[9]=~/\d/ and exists $seen{$ING[9]}){ print OUT "$ING[1]\t$ING[9]\t$seen{$ING[9]}\n"; } } close IN; close OUT;

Thank you all again for all of your help.

Replies are listed 'Best First'.
Re: Match Count
by shmem (Chancellor) on Sep 08, 2008 at 15:37 UTC

    You have improved slightly compared to Count Matches in a File, but I guess you could do better. Maybe re-reading that thread again (it is not long) helps a bit along the way.

    @cytokine=split(/\t/,$_); while (/\d+/ and exists $ING{$cytokine}){ # huh? ---------------------------^^^^^^^^^ $count++;}

    Where does $cytokine get set? And why are you incrementing $count, never using it later? @cytokine and $cytokine are two unrelated variables; the former is an array, while the latter is a scalar. See perldata.

Re: Match Count
by moritz (Cardinal) on Sep 08, 2008 at 14:35 UTC
    Please start all your scripts with
    use strict; use warnings;

    And declare your variables. It catches at least one error that you make in your script.

    (As I already told you in the thread Count Matches in a File - why don't you heed the advice? It's good advice after all).

    This looks also wrong:

    $ING{$t[9]}=$t[1];

    Array indexes start with 0, so you'll probably want $ING{$t[8]} = $t[0]; instead.

Re: Match Count
by apl (Monsignor) on Sep 08, 2008 at 15:10 UTC
    You might want to put a || die after your open of ING_cytokines_20080805.txt.

    You should also display the name of the file and the returned error code in each die.
Re: Match Count
by dwm042 (Priest) on Sep 08, 2008 at 16:15 UTC
    de2425, this is a fast and dirty rewrite of your code, incorporating some of moritz's and shmem's comments. Further, I'll note you don't get keys from $ING, but rather from the hash %ING. The code below has not been run but it has been checked for syntax with perl -cw. I'll note that most of what I've done is clean up the variable syntax (and I'm sure someone else could do a better job of it).
    I've also taken the time to eliminate a lot of those hard coded files and directory settings. The name of the output file can now be passed as a parameter as well.

    #!/usr/bin/perl use warnings; use strict; # # Rewrite for legibility. # This code is not tested. # my $output_file = shift || "Cytokine.txt"; my $work_dir = "c:/work/Cytokine_By_Company"; my $report_dir = "c:/work/GeneID_Count"; my $data_set_one = "ING_cytokines_200805.txt"; my $data_set_two = "CytokineArrays.txt"; my %ING; open (IN, "< $work_dir/$data_set_one") or die "Could not open input file $data_set_one. $!\n"; while (<IN>) { chomp; my @source_data = split(/\t/,$_); $ING{$source_data[8]}{name} = $source_data[0]; $ING{$source_data[8]}{count} = 0; } close IN; open (OUT, "> $report_dir/$output_file") or die "Could not open file $output_file. $!\n"; open (IN, "< $work_dir/$data_set_two") or die "Could not open file $data_set_two. $!\n"; while(<IN>) { chomp; my @target_data = split(/\t/,$_); my $cytokine_id = $target_data[2]; while (/\d+/ and exists $ING{$cytokine_id}{name}) { $ING{$cytokine_id}{count}++; } for my $id ( sort{ $ING{$b} <=> $ING{$a} } keys %ING) { print OUT "$ING{$id}{name}\t$ING{$id}{count}\n"; } } close IN; close OUT;