de2425 has asked for the wisdom of the Perl Monks concerning the following question:

I'm having some mental trouble today and can't think of what to do. I have a file of data that has entry such as:

name, description, ID

I have another file with multiple groupings like the one shown above. What I'm wanting to do is count the number times the ID from the first file occurs in the second file and then group them by the ID. I would like the output to look something like

name, description, ID, # of occurrences

Below is what I've tried so far, which very obviously doesn't work. I am just simply not able to think today. If anyone can help, I'd very much appreciate it.

#! /usr/local/bin/perl -w open (CYT, "C:/Work/ING_Occurrences_Companies/CytokineArrays.txt"); while (<CYT>){ chomp; @cytokine=split(/\t/,$_);} close CYT; open (OUT, ">C:/Work/ING_Occurrences_Companies/ING_Count.txt"); open (IN, "C:/Work/ING_Occurrences_Companies/ING.txt"); while (<IN>){ chomp; @ING=split(/\t/,$_); $count = 0; if($ING[2] =~ /@cytokine[1]/){ $count++; print OUT "$ING[0]\t$ING[1]\t$cytokine{$ING[2]}\t$count\t\n"; }} close IN; close OUT;

Replies are listed 'Best First'.
Re: Count Matches in a File
by shmem (Chancellor) on Sep 03, 2008 at 18:58 UTC

    Many errors; let's treat them piecemeal.

    open (CYT, "C:/Work/ING_Occurrences_Companies/CytokineArrays.txt");

    What happens it the file can't be opened (file vanished, typo in file name) ? perl doesn't tell if you don't tell it to tell. If the file can't be opened, it is sensible to stop processing, since all later processing doesn't make sense. Use die for that:

    open (CYT, "C:/Work/ING_Occurrences_Companies/CytokineArrays.txt") or die "can't read 'C:/Work/ING_Occurrences_Companies/CytokineArra +ys.txt': $!\n";

    Now there could be a mismatch between the filename in the open statement and the die; it is better to use a variable. It is good practice to declare a variable as pertinent to the current file (or scope), so my is used here. Declaring the variables enables usage of strict, which will complain about undeclared variables, e.g. typos. Use it always.

    use strict; my $cytok_file = "C:/Work/ING_Occurrences_Companies/CytokineArrays.txt +"; open CYT, '<', $cytok_file or die "can't read '$cytok_file': $!\n";

    3-argument open lets you see open mode at a glance. See open.

    while (<CYT>){ chomp; @cytokine=split(/\t/,$_);}

    Here you are assigning the result from split to an array, overwriting it at each pass through the loop, loosing the previous information. You want count occurrences of IDs - use a hash for that. See perldata. Again, use my to declare your lexical scoped variables.

    while (<CYT>) { chomp; my ($name, $description, $ID ) = split /\t/; $occurrences{$ID} = 0; # initial count }

    split defaults to operate on $_, so that can be omitted. But since $name and $description are never used, it is not necessary to gather them in the first place. You are interested in the third element of the list which split returns, so grab that (index starts with 0):

    while (<CYT>) { chomp; my $ID = (split /\t/)[2]; $occurrences{$ID} = 0; # initial count }

    or even

    while (<CYT>) { chomp; $occurrences{ (split /\t/)[2] } = 0; # initial count }

    although the latter might be too terse, since it doesn't give a clue anymore about what that third element is.

    close CYT;

    Again, it is sensible to check the return value of a system call:

    close CYT or die "can't close filehandle CYT properly: $!\n";

    In the next 2 lines, you are opening files for reading:

    open (OUT, ">C:/Work/ING_Occurrences_Companies/ING_Count.txt"); open (IN, "C:/Work/ING_Occurrences_Companies/ING.txt");

    Perl can't tell your intention from the filehandle name. Again, use variables for your file names.

    my $outfile = "C:/Work/ING_Occurrences_Companies/ING_Count.txt"; my $infile = "C:/Work/ING_Occurrences_Companies/ING.txt"; open OUT, '>', $outfile or die "can't write '$outfile': $!\n"; open IN, '<', $infile or die "can't read '$infile': $!\n";

    Vertical alignment of common element on consecutive lines makes your code more readable (as does proper indenting). From the next block

    while (<IN>){ chomp; @ING=split(/\t/,$_); $count = 0; if($ING[2] =~ /@cytokine[1]/){ $count++; print OUT "$ING[0]\t$ING[1]\t$cytokine{$ING[2]}\t$count\t\n"; }}

    I deduce that the format of the input file is identical to the first file read, and that the ID is in the 3rd field. Just grab the ID as a key to the hash %occurrences and increment the value stored there. Store the line in another hash, keyed also keyed on the ID -

    my %lines; while (<IN>) { chomp; my $ID = (split /\t/)[2]; $ocurrences{$ID}++; $lines{$ID} = $_; }

    - then sort the keys of %ocurrences, iterate over them and output your data:

    foreach my $ID (sort keys %ocurrences) { print $lines{$ID}, "\t", $ocurrences{$ID}, "\n"; }

    Depending of the type of your IDs you might want to sort them numerically. See sort.

Re: Count Matches in a File
by moritz (Cardinal) on Sep 03, 2008 at 17:41 UTC
    Your first step should be to start your script with
    use strict; use warnings;

    And declare your variables with my.

    It would catch some of your mistakes. For example you're populating the array @cytokine, and later you try to read out of $cytokine{$ING[2]}, which access the (totally different) variable %cytokine.