Re: Automatic creation of a hash of hashes of arrays (?)

Welcome to the monastery.

"What I have in mind is to create a hash that contains the chromosome names as keys (%chrHash as above), but for each chromosome use a hash of discrete counts that in turn contains the locations. So if I find that read A comes from chr1:100, then $chrHash{"chr1"}{1} should be assigned 100. Since read B and C are also coming from the same location, $chrHash{"chr1"}{2} and $chrHash{"chr1"}{3} should subsequently be assigned location 100. Location 100 is thus in a way moving up the ladder, so I also need to erase it from the previous counts."

I wouldn't do it that way: all the additional coding and processing to implement "erase it from the previous counts" is unnecessary. Instead, I would use a structure where the values of each $chrHash{$chr}{$location} are the counts.

"The problem for me is: how can I assign an empty array to each of the discrete counts, which can then be used to store all locations once they have been identified?"

Again, this would be unnecessary and could lead to a hash that uses large amounts of memory and takes a long time to process.

"My task is to count how many reads are coming from a certain location."

"I am dealing with a huge number of locations and I am only interested in non-zero counts"

Those two statements are ambiguous; possibly contradictory. In the code below, get_loc_reads() will return a count from a specific location; get_read_counts() will output all the locations with a positive count. You may need to pick and choose code from those two to get exactly what you want: hopefully, the techniques to access the data are obvious.

In this script I've had to guess at your input data format. I've also provided an option (currently commented out) to store the sequence identifers — more on this after the code.

#!/usr/bin/env perl -l

use strict;
use warnings;

my %count;

while (<DATA>) {
    my ($seq, $chr, $loc) = split;
    #push @{$count{$chr}{$loc}}, $seq;
    ++$count{$chr}{$loc};
}

print 'Reads by specific chromosome and location:';
my @wanted = ([qw{chr1 100}], [qw{chr1 500}], [qw{chr2 200}], [qw{chr3
+ 100}]);

for (@wanted) {
    print "Reads on CHR:$_->[0] LOC:$_->[1] = ", get_loc_reads(@$_);
}

print 'Reads with positive counts by chromosome and location:';
get_read_counts();

sub get_loc_reads {
    my ($chr, $loc) = @_;
    #return exists $count{$chr}{$loc} ? scalar @{$count{$chr}{$loc}} :
+ 0;
    return exists $count{$chr}{$loc} ? $count{$chr}{$loc} : 0;
}

sub get_read_counts {
    for my $chr (sort keys %count) {
        for my $loc (sort keys %{$count{$chr}}) {
            #print "Reads on CHR:$chr LOC:$loc = ", scalar @{$count{$c
+hr}{$loc}};
            print "Reads on CHR:$chr LOC:$loc = ", $count{$chr}{$loc};
        } 
    }
}

__DATA__
A chr1 100
B chr1 100
C chr1 100
D chr1 200
E chr1 300
F chr1 300
G chr1 300
H chr2 100
I chr2 200
J chr2 200
[download]

Output:

Reads by specific chromosome and location:
Reads on CHR:chr1 LOC:100 = 3
Reads on CHR:chr1 LOC:500 = 0
Reads on CHR:chr2 LOC:200 = 2
Reads on CHR:chr3 LOC:100 = 0
Reads with positive counts by chromosome and location:
Reads on CHR:chr1 LOC:100 = 3
Reads on CHR:chr1 LOC:200 = 1
Reads on CHR:chr1 LOC:300 = 3
Reads on CHR:chr2 LOC:100 = 1
Reads on CHR:chr2 LOC:200 = 2
[download]

As I alluded to above, there's also an option to store the sequence identifiers. I didn't know if you wanted this but it was easy to code and shows another method of getting a count that you may find useful now or in the future.

Three lines are commented out: uncomment these and comment out the line that follows each of them. Here's the changes:

    push @{$count{$chr}{$loc}}, $seq;
    #++$count{$chr}{$loc};
...
    return exists $count{$chr}{$loc} ? scalar @{$count{$chr}{$loc}} : 
+0;
    #return exists $count{$chr}{$loc} ? $count{$chr}{$loc} : 0;
...
            print "Reads on CHR:$chr LOC:$loc = ", scalar @{$count{$ch
+r}{$loc}};
            #print "Reads on CHR:$chr LOC:$loc = ", $count{$chr}{$loc}
+;
[download]

The output is identical but you now have access to the sequences, via @{$count{$chr}{$loc}}, should you need them.

-- Ken

Comment on Re: Automatic creation of a hash of hashes of arrays (?) Select or Download Code

Replies are listed 'Best First'.
Re^2: Automatic creation of a hash of hashes of arrays (?) by bontus (Novice) on Jan 30, 2014 at 12:55 UTC
Thank you both for your suggestions. I followed Ken's advice and solved my task, so thank you again and have a nice day :)	[reply]