Looping through a file, reading each line, and adding keys/editing values of a hash

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello -- I am working on some coding for a bioinformatics-type project. (Though the perl code shouldn't terribly complicated.) I have a (large) tab delimited file that I would like to read line by line. And for each line, I need to update information in a hash. Column 19 is the name of a gene. The rest of the columns describe a single variant within that gene. Genes can have multiple variants (thus, many lines of the file will be referring to the same gene and have the same information in column 19). I would like to read each line of the file, store the gene name (column 19) as the key in a hash, and then add 1 to the value. The end result should be a hash with ~20,000 keys, with corresponding values representing the number of variants (or numbers of lines in the file describing that particular gene). I am not interested in the type of variant, just the number. Here is what I have so far. The file with genes and variant information is to be entered on the command line.

#!/usr/bin/perl
use strict;
use warnings;

my $filename = $ARGV{0};

open( my $fh => $filename) || die "Cannot open $filename: $!";
my %gene_count;

while(my $line = <$fh>) {
        my @row = split("\t",$line);
        $gene_count{ $row[18] } = ++;
}
close($fh);
[download]

Here is what I am struggling with: 1) Can I write over the value of a key? ++ doesn't seem to be appropriate. 2) How do I start the value at 0 or 1? Perhaps this is not a simple enough problem for a single loop. 3) Will I be encountering issues once I get to the second time a gene is listed? Any help/direction/advice is much appreciated! Cheers, A

Comment on Looping through a file, reading each line, and adding keys/editing values of a hash Download Code

Replies are listed 'Best First'.
Re: Looping through a file, reading each line, and adding keys/editing values of a hash by Kenosis (Priest) on Dec 05, 2013 at 04:19 UTC
You have a great start on your script! As for your 'struggles': 1) yes, and `++` is a perfectly appropriate and a common construct; 2) The value for what? If you mean incrementing the hash value, you're doing it correctly. If you mean `$row[18]` to get the value of col 19, you're doing it correctly.; and 3) no. Here are some suggested changes to consider for your script: `use strict; use warnings; my $filename = $ARGV[0]; my %gene_count; open my $fh, '<', $filename or die "Cannot open $filename: $!"; while ( my $line = <$fh> ) { chomp; my @row = split( "\t", $line ); $gene_count{ $row[18] }++ if $row[18]; } close($fh); print "$_ => $gene_count{$_}\n" for sort keys %gene_count;` [download] `$ARGV{0}` -> `$ARGV[0]` Made a few changes to your `open` Added `chomp` because you're `split`ting on the tab character. If you don't `chomp`, a newline will be on the end of the array's last element (with the exception of the file's last line). `= ++` -> `++` Added `if $row[18]` to check for 'good' key candidate. This check could be stronger, but is likely sufficient, in this case. Just fyi: The parens of `split` and `close` are optional. Added printing the sorted key/value pairs. (Just assumed you wanted to do that... :) Since you're sending your script the filename from the command line, you can let Perl handle the file i/o. If you `split` on ' ' (whitespace) you don't need to `chomp`. Also, you can send `split` a LIMIT to its `split`ting, so it's not `split`ting all columns. Using this LIMIT can significantly speed the `split`ting process. Given this, the following is functionally equivalent: `use strict; use warnings; my %gene_count; while (<>) { my @rows = split ' ', $_, 20; $gene_count{ $row[18] }++ if $row[18]; } print "$_ => $gene_count{$_}\n" for sort keys %gene_count;` [download] Your original script's logic is good; only minor fixes were needed. You've done well... Hope this helps!	[reply] [d/l] [select]
Re^2: Looping through a file, reading each line, and adding keys/editing values of a hash by GrandFather (Saint) on Dec 05, 2013 at 08:12 UTC
Why use a post-increment instead of a pre-increment when the value is not being used? `$gene_count{ $row[18] }++` is (imo) better written `++$gene_count{$row[18]}` so the increment is obvious. True laziness is hard work	[reply] [d/l] [select]
Re^3: Looping through a file, reading each line, and adding keys/editing values of a hash by Kenosis (Priest) on Dec 06, 2013 at 16:48 UTC
Interesting question. One reason is that the OP already attempted a post-increment, and since the two increment types would produce the same outcome, why make the change? Another reason is my personal preference for this counting situation. If, for example, I were on a sidewalk, tallying all the red cars that passed by, I wouldn't make a tally mark upon their approach (pre-increment), but rather after they crossed an imaginary line extending across the street from my position (post-increment). Perhaps this ultimately boils down to personal preference, in cases like these...	[reply]
Re^3: Looping through a file, reading each line, and adding keys/editing values of a hash by Anonymous Monk on Dec 05, 2013 at 20:16 UTC
There are different schools of thought there. I strongly favour the postincrement.	[reply]
Re^4: Looping through a file, reading each line, and adding keys/editing values of a hash by GrandFather (Saint) on Dec 05, 2013 at 20:47 UTC
Re^5: Looping through a file, reading each line, and adding keys/editing values of a hash by choroba (Cardinal) on Dec 06, 2013 at 17:10 UTC
Re^2: Looping through a file, reading each line, and adding keys/editing values of a hash by Anonymous Monk on Dec 05, 2013 at 05:58 UTC
So, so helpful! Thank you. (And glad to see that I was on the right track and didn't need any major re-organizing). Thanks again! Cheers, Amelia	[reply]
Re^3: Looping through a file, reading each line, and adding keys/editing values of a hash by Kenosis (Priest) on Dec 05, 2013 at 06:01 UTC
You're most welcome, Amelia!	[reply]
Re: Looping through a file, reading each line, and adding keys/editing values of a hash by aaron_baugher (Curate) on Dec 05, 2013 at 15:44 UTC
You've already gotten the fix, but the important thing to note here is that you don't have to "start" a hash value with anything. You can just increment it, and it will be created and set to 1 if it didn't already exist. That saves you from having to do code like this, which might be necessary in lower-level languages: `if( $gene_count{$key} ){ # it already exists $gene_count{$key}++; # so increment it } else { # it doesn't exist $gene_count{$key} = 1; # so create and set it }` [download] Aaron B. Available for small or large Perl jobs and *nix system administration; see my home node.	[reply] [d/l]
Re^2: Looping through a file, reading each line, and adding keys/editing values of a hash by Anonymous Monk on Nov 10, 2015 at 06:13 UTC
Hi, If I want to store the count in the file itself as a key-value pair, how to do it? Thanks.	[reply]