Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello -- I am working on some coding for a bioinformatics-type project. (Though the perl code shouldn't terribly complicated.) I have a (large) tab delimited file that I would like to read line by line. And for each line, I need to update information in a hash. Column 19 is the name of a gene. The rest of the columns describe a single variant within that gene. Genes can have multiple variants (thus, many lines of the file will be referring to the same gene and have the same information in column 19). I would like to read each line of the file, store the gene name (column 19) as the key in a hash, and then add 1 to the value. The end result should be a hash with ~20,000 keys, with corresponding values representing the number of variants (or numbers of lines in the file describing that particular gene). I am not interested in the type of variant, just the number. Here is what I have so far. The file with genes and variant information is to be entered on the command line.

#!/usr/bin/perl use strict; use warnings; my $filename = $ARGV{0}; open( my $fh => $filename) || die "Cannot open $filename: $!"; my %gene_count; while(my $line = <$fh>) { my @row = split("\t",$line); $gene_count{ $row[18] } = ++; } close($fh);

Here is what I am struggling with: 1) Can I write over the value of a key? ++ doesn't seem to be appropriate. 2) How do I start the value at 0 or 1? Perhaps this is not a simple enough problem for a single loop. 3) Will I be encountering issues once I get to the second time a gene is listed? Any help/direction/advice is much appreciated! Cheers, A

  • Comment on Looping through a file, reading each line, and adding keys/editing values of a hash
  • Download Code

Replies are listed 'Best First'.
Re: Looping through a file, reading each line, and adding keys/editing values of a hash
by Kenosis (Priest) on Dec 05, 2013 at 04:19 UTC

    You have a great start on your script! As for your 'struggles': 1) yes, and ++ is a perfectly appropriate and a common construct; 2) The value for what? If you mean incrementing the hash value, you're doing it correctly. If you mean $row[18] to get the value of col 19, you're doing it correctly.; and 3) no.

    Here are some suggested changes to consider for your script:

    use strict; use warnings; my $filename = $ARGV[0]; my %gene_count; open my $fh, '<', $filename or die "Cannot open $filename: $!"; while ( my $line = <$fh> ) { chomp; my @row = split( "\t", $line ); $gene_count{ $row[18] }++ if $row[18]; } close($fh); print "$_ => $gene_count{$_}\n" for sort keys %gene_count;
    • $ARGV{0} -> $ARGV[0]
    • Made a few changes to your open
    • Added chomp because you're splitting on the tab character. If you don't chomp, a newline will be on the end of the array's last element (with the exception of the file's last line).
    •  = ++ -> ++
    • Added if $row[18] to check for 'good' key candidate. This check could be stronger, but is likely sufficient, in this case.
    • Just fyi: The parens of split and close are optional.
    • Added printing the sorted key/value pairs. (Just assumed you wanted to do that... :)

    Since you're sending your script the filename from the command line, you can let Perl handle the file i/o. If you split on ' ' (whitespace) you don't need to chomp. Also, you can send split a LIMIT to its splitting, so it's not splitting all columns. Using this LIMIT can significantly speed the splitting process. Given this, the following is functionally equivalent:

    use strict; use warnings; my %gene_count; while (<>) { my @rows = split ' ', $_, 20; $gene_count{ $row[18] }++ if $row[18]; } print "$_ => $gene_count{$_}\n" for sort keys %gene_count;

    Your original script's logic is good; only minor fixes were needed. You've done well...

    Hope this helps!

      Why use a post-increment instead of a pre-increment when the value is not being used? $gene_count{ $row[18] }++ is (imo) better written ++$gene_count{$row[18]} so the increment is obvious.

      True laziness is hard work

        Interesting question.

        One reason is that the OP already attempted a post-increment, and since the two increment types would produce the same outcome, why make the change?

        Another reason is my personal preference for this counting situation. If, for example, I were on a sidewalk, tallying all the red cars that passed by, I wouldn't make a tally mark upon their approach (pre-increment), but rather after they crossed an imaginary line extending across the street from my position (post-increment).

        Perhaps this ultimately boils down to personal preference, in cases like these...

        There are different schools of thought there. I strongly favour the postincrement.

      So, so helpful! Thank you. (And glad to see that I was on the right track and didn't need any major re-organizing). Thanks again! Cheers, Amelia

        You're most welcome, Amelia!

Re: Looping through a file, reading each line, and adding keys/editing values of a hash
by aaron_baugher (Curate) on Dec 05, 2013 at 15:44 UTC

    You've already gotten the fix, but the important thing to note here is that you don't have to "start" a hash value with anything. You can just increment it, and it will be created and set to 1 if it didn't already exist. That saves you from having to do code like this, which might be necessary in lower-level languages:

    if( $gene_count{$key} ){ # it already exists $gene_count{$key}++; # so increment it } else { # it doesn't exist $gene_count{$key} = 1; # so create and set it }

    Aaron B.
    Available for small or large Perl jobs and *nix system administration; see my home node.

      Hi, If I want to store the count in the file itself as a key-value pair, how to do it? Thanks.