bontus has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

My question concerns the automatic creation of a hash of hashes for counting purposes. Let's start off with the basics:

I have a number of chromosome names my @chromosomes = ("chr1","chr2","chr3"); each of which is a separate entity than can be regarded as an array of locations such as @locations = (0..$chrLength); $chrLength is dependent on which chromosome we are looking at, but let's assume it to be 1000 for now.

Next, I have a number of short sequences (reads) for which I know the location within one specific chromosome, e.g.: read A, B and C come from chr1:100, whereas read D is the only read from chr1:200. Thus, chr1:100 has 3 counts and chr1:200 has 1 count, whilst all other locations have 0 counts. This data is stored in a file, with each read being a separate line. My task is to count how many reads are coming from a certain location.

So far, I think I have a working solution, which looks somewhat like this:

if $locationOfRead == $location $chrHash->{$chr}->{$location}++
but since I am dealing with a huge number of locations and I am only interested in non-zero counts, I want to avoid looping through the hash for each chromosome searching for counts of >= 1.

What I have in mind is to create a hash that contains the chromosome names as keys (%chrHash as above), but for each chromosome use a hash of discrete counts that in turn contains the locations. So if I find that read A comes from chr1:100, then $chrHash{"chr1"}{1} should be assigned 100. Since read B and C are also coming from the same location, $chrHash{"chr1"}{2} and $chrHash{"chr1"}{3} should subsequently be assigned location 100. Location 100 is thus in a way moving up the ladder, so I also need to erase it from the previous counts.

Furthermore, if we assume another location (chr1:300) also has three reads, both 100 and 300 should then be listed in $chrHash{"chr1"}{3}.

Hence, I would need a construction in the way of $chrHash->{$chr}->{$counts}->{@locations}.

The problem for me is: how can I assign an empty array to each of the discrete counts, which can then be used to store all locations once they have been identified?

I hope this explanation is somewhat clear. Thank you for your time and best regards,

bontus

PS: my code so far

use POSIX qw(ceil floor); use strict; my $binSize = $ARGV[0]; # size of each bin (default 1000) my $readSize = $ARGV[1]; my @chromosomes = ("chr1","chr2","chr3"); while(<STDIN>) {} # parse through file once to get number of lines my $maxCount = $.; # max count => all reads from the same locus => num +ber of lines in SAM file my @counts = (0..$maxCount); my @countHash; map { $countHash{$_} = "" } @counts; # ??? my %chrHash; map { $chrHash{$_} = { %countHash } } @chromosomes; # old version: my @bins = (0..(250000000/$binSize); # chr1 ~ 250 mio bases my @binHash; my %chrHash; map { $binHash{$_} = 0 } @bins; # assume that all chromosomes are equal size and create a hash of hash +es for counting map { $chrHash{$_} = { %binHash } } @chromosomes; # Read in SAM file and count reads per bin while(<STDIN>) { chomp(); my @line = split(/\t/,$_); if ($line[2] ~~ @chromosomes) { my $location = floor(ceil((2*$line[3]+$readSize)/2)/1000); $chrHash{$line[2]}{$location}++; } }

Replies are listed 'Best First'.
Re: Automatic creation of a hash of hashes of arrays (?)
by Kenosis (Priest) on Jan 27, 2014 at 18:21 UTC

    Perhaps the following data structure--a hash of hashes of arrays (HoHoA)--would be helpful (as suggested in your posting's topic):

    use strict; use warnings; use Data::Dumper; my %chrHash; push @{ $chrHash{"chr1"}{3} }, 100; push @{ $chrHash{"chr1"}{3} }, 300; print 'Count for chr1 3: ', scalar @{ $chrHash{"chr1"}{3} }, "\n"; print qq/Locations for chr1 3: @{ $chrHash{"chr1"}{3} }\n\n/; print Dumper \%chrHash;

    Output:

    Count for chr1 3: 2 Locations for chr1 3: 100 300 $VAR1 = { 'chr1' => { '3' => [ 100, 300 ] } };

    The arrays would track both count and locations.

    On another issue, consider:

    $chrHash{$_} = { %countHash } for @chromosomes;

    instead of using map in a void context, like:

    map { $chrHash{$_} = { %countHash } } @chromosomes;

    Also, consider using grep or List::Util's first instead of the smart-match operator (see Smart matching is experimental/depreciated in 5.18 - recommendations?):

    use strict; use warnings; use List::Util qw/first/; my @chromosomes = ( "chr1", "chr2", "chr3", "chr5" ); my @line = ( "chr1", "chr3", "chr5" ); if ( grep $line[2] eq $_, @chromosomes ) { print "Found $line[2] using grep!\n"; } if ( first { $line[2] eq $_ } @chromosomes ) { print "Found $line[2] using first!\n"; }

    Output:

    Found chr5 using grep! Found chr5 using first!

    The advantage of first is that it terminates upon the first successful find, whereas grep will iterate through the entire list.

Re: Automatic creation of a hash of hashes of arrays (?)
by kcott (Archbishop) on Jan 27, 2014 at 22:58 UTC

    G'day bontus,

    Welcome to the monastery.

    "What I have in mind is to create a hash that contains the chromosome names as keys (%chrHash as above), but for each chromosome use a hash of discrete counts that in turn contains the locations. So if I find that read A comes from chr1:100, then $chrHash{"chr1"}{1} should be assigned 100. Since read B and C are also coming from the same location, $chrHash{"chr1"}{2} and $chrHash{"chr1"}{3} should subsequently be assigned location 100. Location 100 is thus in a way moving up the ladder, so I also need to erase it from the previous counts."

    I wouldn't do it that way: all the additional coding and processing to implement "erase it from the previous counts" is unnecessary. Instead, I would use a structure where the values of each $chrHash{$chr}{$location} are the counts.

    "The problem for me is: how can I assign an empty array to each of the discrete counts, which can then be used to store all locations once they have been identified?"

    Again, this would be unnecessary and could lead to a hash that uses large amounts of memory and takes a long time to process.

    "My task is to count how many reads are coming from a certain location."
    "I am dealing with a huge number of locations and I am only interested in non-zero counts"

    Those two statements are ambiguous; possibly contradictory. In the code below, get_loc_reads() will return a count from a specific location; get_read_counts() will output all the locations with a positive count. You may need to pick and choose code from those two to get exactly what you want: hopefully, the techniques to access the data are obvious.

    In this script I've had to guess at your input data format. I've also provided an option (currently commented out) to store the sequence identifers — more on this after the code.

    #!/usr/bin/env perl -l use strict; use warnings; my %count; while (<DATA>) { my ($seq, $chr, $loc) = split; #push @{$count{$chr}{$loc}}, $seq; ++$count{$chr}{$loc}; } print 'Reads by specific chromosome and location:'; my @wanted = ([qw{chr1 100}], [qw{chr1 500}], [qw{chr2 200}], [qw{chr3 + 100}]); for (@wanted) { print "Reads on CHR:$_->[0] LOC:$_->[1] = ", get_loc_reads(@$_); } print 'Reads with positive counts by chromosome and location:'; get_read_counts(); sub get_loc_reads { my ($chr, $loc) = @_; #return exists $count{$chr}{$loc} ? scalar @{$count{$chr}{$loc}} : + 0; return exists $count{$chr}{$loc} ? $count{$chr}{$loc} : 0; } sub get_read_counts { for my $chr (sort keys %count) { for my $loc (sort keys %{$count{$chr}}) { #print "Reads on CHR:$chr LOC:$loc = ", scalar @{$count{$c +hr}{$loc}}; print "Reads on CHR:$chr LOC:$loc = ", $count{$chr}{$loc}; } } } __DATA__ A chr1 100 B chr1 100 C chr1 100 D chr1 200 E chr1 300 F chr1 300 G chr1 300 H chr2 100 I chr2 200 J chr2 200

    Output:

    Reads by specific chromosome and location: Reads on CHR:chr1 LOC:100 = 3 Reads on CHR:chr1 LOC:500 = 0 Reads on CHR:chr2 LOC:200 = 2 Reads on CHR:chr3 LOC:100 = 0 Reads with positive counts by chromosome and location: Reads on CHR:chr1 LOC:100 = 3 Reads on CHR:chr1 LOC:200 = 1 Reads on CHR:chr1 LOC:300 = 3 Reads on CHR:chr2 LOC:100 = 1 Reads on CHR:chr2 LOC:200 = 2

    As I alluded to above, there's also an option to store the sequence identifiers. I didn't know if you wanted this but it was easy to code and shows another method of getting a count that you may find useful now or in the future.

    Three lines are commented out: uncomment these and comment out the line that follows each of them. Here's the changes:

    push @{$count{$chr}{$loc}}, $seq; #++$count{$chr}{$loc}; ... return exists $count{$chr}{$loc} ? scalar @{$count{$chr}{$loc}} : +0; #return exists $count{$chr}{$loc} ? $count{$chr}{$loc} : 0; ... print "Reads on CHR:$chr LOC:$loc = ", scalar @{$count{$ch +r}{$loc}}; #print "Reads on CHR:$chr LOC:$loc = ", $count{$chr}{$loc} +;

    The output is identical but you now have access to the sequences, via @{$count{$chr}{$loc}}, should you need them.

    -- Ken

      Thank you both for your suggestions. I followed Ken's advice and solved my task, so thank you again and have a nice day :)