bontus has asked for the wisdom of the Perl Monks concerning the following question:
My question concerns the automatic creation of a hash of hashes for counting purposes. Let's start off with the basics:
I have a number of chromosome names my @chromosomes = ("chr1","chr2","chr3"); each of which is a separate entity than can be regarded as an array of locations such as @locations = (0..$chrLength); $chrLength is dependent on which chromosome we are looking at, but let's assume it to be 1000 for now.
Next, I have a number of short sequences (reads) for which I know the location within one specific chromosome, e.g.: read A, B and C come from chr1:100, whereas read D is the only read from chr1:200. Thus, chr1:100 has 3 counts and chr1:200 has 1 count, whilst all other locations have 0 counts. This data is stored in a file, with each read being a separate line. My task is to count how many reads are coming from a certain location.
So far, I think I have a working solution, which looks somewhat like this:
but since I am dealing with a huge number of locations and I am only interested in non-zero counts, I want to avoid looping through the hash for each chromosome searching for counts of >= 1.if $locationOfRead == $location $chrHash->{$chr}->{$location}++
What I have in mind is to create a hash that contains the chromosome names as keys (%chrHash as above), but for each chromosome use a hash of discrete counts that in turn contains the locations. So if I find that read A comes from chr1:100, then $chrHash{"chr1"}{1} should be assigned 100. Since read B and C are also coming from the same location, $chrHash{"chr1"}{2} and $chrHash{"chr1"}{3} should subsequently be assigned location 100. Location 100 is thus in a way moving up the ladder, so I also need to erase it from the previous counts.
Furthermore, if we assume another location (chr1:300) also has three reads, both 100 and 300 should then be listed in $chrHash{"chr1"}{3}.
Hence, I would need a construction in the way of $chrHash->{$chr}->{$counts}->{@locations}.
The problem for me is: how can I assign an empty array to each of the discrete counts, which can then be used to store all locations once they have been identified?
I hope this explanation is somewhat clear. Thank you for your time and best regards,
bontusPS: my code so far
use POSIX qw(ceil floor); use strict; my $binSize = $ARGV[0]; # size of each bin (default 1000) my $readSize = $ARGV[1]; my @chromosomes = ("chr1","chr2","chr3"); while(<STDIN>) {} # parse through file once to get number of lines my $maxCount = $.; # max count => all reads from the same locus => num +ber of lines in SAM file my @counts = (0..$maxCount); my @countHash; map { $countHash{$_} = "" } @counts; # ??? my %chrHash; map { $chrHash{$_} = { %countHash } } @chromosomes; # old version: my @bins = (0..(250000000/$binSize); # chr1 ~ 250 mio bases my @binHash; my %chrHash; map { $binHash{$_} = 0 } @bins; # assume that all chromosomes are equal size and create a hash of hash +es for counting map { $chrHash{$_} = { %binHash } } @chromosomes; # Read in SAM file and count reads per bin while(<STDIN>) { chomp(); my @line = split(/\t/,$_); if ($line[2] ~~ @chromosomes) { my $location = floor(ceil((2*$line[3]+$readSize)/2)/1000); $chrHash{$line[2]}{$location}++; } }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Automatic creation of a hash of hashes of arrays (?)
by Kenosis (Priest) on Jan 27, 2014 at 18:21 UTC | |
|
Re: Automatic creation of a hash of hashes of arrays (?)
by kcott (Archbishop) on Jan 27, 2014 at 22:58 UTC | |
by bontus (Novice) on Jan 30, 2014 at 12:55 UTC |