Hello Monks,

My question concerns the automatic creation of a hash of hashes for counting purposes. Let's start off with the basics:

I have a number of chromosome names my @chromosomes = ("chr1","chr2","chr3"); each of which is a separate entity than can be regarded as an array of locations such as @locations = (0..$chrLength); $chrLength is dependent on which chromosome we are looking at, but let's assume it to be 1000 for now.

Next, I have a number of short sequences (reads) for which I know the location within one specific chromosome, e.g.: read A, B and C come from chr1:100, whereas read D is the only read from chr1:200. Thus, chr1:100 has 3 counts and chr1:200 has 1 count, whilst all other locations have 0 counts. This data is stored in a file, with each read being a separate line. My task is to count how many reads are coming from a certain location.

So far, I think I have a working solution, which looks somewhat like this:

if $locationOfRead == $location $chrHash->{$chr}->{$location}++
but since I am dealing with a huge number of locations and I am only interested in non-zero counts, I want to avoid looping through the hash for each chromosome searching for counts of >= 1.

What I have in mind is to create a hash that contains the chromosome names as keys (%chrHash as above), but for each chromosome use a hash of discrete counts that in turn contains the locations. So if I find that read A comes from chr1:100, then $chrHash{"chr1"}{1} should be assigned 100. Since read B and C are also coming from the same location, $chrHash{"chr1"}{2} and $chrHash{"chr1"}{3} should subsequently be assigned location 100. Location 100 is thus in a way moving up the ladder, so I also need to erase it from the previous counts.

Furthermore, if we assume another location (chr1:300) also has three reads, both 100 and 300 should then be listed in $chrHash{"chr1"}{3}.

Hence, I would need a construction in the way of $chrHash->{$chr}->{$counts}->{@locations}.

The problem for me is: how can I assign an empty array to each of the discrete counts, which can then be used to store all locations once they have been identified?

I hope this explanation is somewhat clear. Thank you for your time and best regards,

bontus

PS: my code so far

use POSIX qw(ceil floor); use strict; my $binSize = $ARGV[0]; # size of each bin (default 1000) my $readSize = $ARGV[1]; my @chromosomes = ("chr1","chr2","chr3"); while(<STDIN>) {} # parse through file once to get number of lines my $maxCount = $.; # max count => all reads from the same locus => num +ber of lines in SAM file my @counts = (0..$maxCount); my @countHash; map { $countHash{$_} = "" } @counts; # ??? my %chrHash; map { $chrHash{$_} = { %countHash } } @chromosomes; # old version: my @bins = (0..(250000000/$binSize); # chr1 ~ 250 mio bases my @binHash; my %chrHash; map { $binHash{$_} = 0 } @bins; # assume that all chromosomes are equal size and create a hash of hash +es for counting map { $chrHash{$_} = { %binHash } } @chromosomes; # Read in SAM file and count reads per bin while(<STDIN>) { chomp(); my @line = split(/\t/,$_); if ($line[2] ~~ @chromosomes) { my $location = floor(ceil((2*$line[3]+$readSize)/2)/1000); $chrHash{$line[2]}{$location}++; } }


In reply to Automatic creation of a hash of hashes of arrays (?) by bontus

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.