comment on

Hello Monks,

My question concerns the automatic creation of a hash of hashes for counting purposes. Let's start off with the basics:

I have a number of chromosome names my @chromosomes = ("chr1","chr2","chr3"); each of which is a separate entity than can be regarded as an array of locations such as @locations = (0..$chrLength); $chrLength is dependent on which chromosome we are looking at, but let's assume it to be 1000 for now.

Next, I have a number of short sequences (reads) for which I know the location within one specific chromosome, e.g.: read A, B and C come from chr1:100, whereas read D is the only read from chr1:200. Thus, chr1:100 has 3 counts and chr1:200 has 1 count, whilst all other locations have 0 counts. This data is stored in a file, with each read being a separate line. My task is to count how many reads are coming from a certain location.

So far, I think I have a working solution, which looks somewhat like this:

if $locationOfRead == $location
    $chrHash->{$chr}->{$location}++
[download]

but since I am dealing with a huge number of locations and I am only interested in non-zero counts, I want to avoid looping through the hash for each chromosome searching for counts of >= 1.

What I have in mind is to create a hash that contains the chromosome names as keys (%chrHash as above), but for each chromosome use a hash of discrete counts that in turn contains the locations. So if I find that read A comes from chr1:100, then $chrHash{"chr1"}{1} should be assigned 100. Since read B and C are also coming from the same location, $chrHash{"chr1"}{2} and $chrHash{"chr1"}{3} should subsequently be assigned location 100. Location 100 is thus in a way moving up the ladder, so I also need to erase it from the previous counts.

Furthermore, if we assume another location (chr1:300) also has three reads, both 100 and 300 should then be listed in $chrHash{"chr1"}{3}.

Hence, I would need a construction in the way of $chrHash->{$chr}->{$counts}->{@locations}.

The problem for me is: how can I assign an empty array to each of the discrete counts, which can then be used to store all locations once they have been identified?

I hope this explanation is somewhat clear. Thank you for your time and best regards,

bontus

PS: my code so far

 
use POSIX qw(ceil floor);
use strict; 

my $binSize = $ARGV[0]; # size of each bin (default 1000)
my $readSize = $ARGV[1];
my @chromosomes = ("chr1","chr2","chr3");
while(<STDIN>) {} # parse through file once to get number of lines
my $maxCount = $.; # max count => all reads from the same locus => num
+ber of lines in SAM file
my @counts = (0..$maxCount);
my @countHash;
map { $countHash{$_} = "" } @counts; # ???
my %chrHash;
map { $chrHash{$_} = { %countHash } } @chromosomes;

# old version:

my @bins = (0..(250000000/$binSize); # chr1 ~ 250 mio bases
my @binHash;
my %chrHash;
map { $binHash{$_} = 0 } @bins;
# assume that all chromosomes are equal size and create a hash of hash
+es for counting
map { $chrHash{$_} = { %binHash } } @chromosomes;
# Read in SAM file and count reads per bin
while(<STDIN>) {
    chomp();
    my @line = split(/\t/,$_);
    if ($line[2] ~~ @chromosomes) {
        my $location = floor(ceil((2*$line[3]+$readSize)/2)/1000);
        $chrHash{$line[2]}{$location}++;
    }
}
[download]

In reply to Automatic creation of a hash of hashes of arrays (?) by bontus

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.