I am working on making a hash of the frequency of all "words" of length "n" in a genome. The genome contains 3 billion bases (letters in the set ACTG where A pairs with T and G with C). Using simple perl hashes, I can do this for up to word size 12 with 4Gb of RAM. It seems like it might be possible to do better than this by using the fact that we know that each letter contains only 2 bits of information. Can someone enlighten me about how perl does its hashing and what might be a better solution (more memory-efficient hash, while not sacrificing too much speed)?
An example of what the data look like for word size 11 is given here:
Word Count
=========== =====
CAATGACTGAT 1052
AATGACTGATG 1426
ATGACTGATGT 1170
TGACTGATGTC 1105
GACTGATGTCC 781
ACTGATGTCCT 1148
CTGATGTCCTT 1468
TGATGTCCTTC 916
...
Code is here:
sub index_file { my %params = @_; my $hashref = exists($params{hashref}) ? $params{hashref} : {}; my $file = $params{file}; my $window = $params{window}; open(INF,$file) or die "Cannot open file $file : $!"; print "Reading file....\n"; my $sequence; while (my $line = <INF>) { chomp($line); $sequence .= $line unless ($line=~/^>.*/); } close(INF); $sequence =~ tr/a-z/A-Z/; $sequence =~ s/N//g; my $offset=0; print "Calculating....\n"; for ($offset=0;$offset<length($sequence)-$window;$offset++) { print "$offset\n" if ($offset % 10000000 ==0); $hashref->{substr($sequence,$offset,$window)}++; } return($hashref); }
Thanks,
Sean
In reply to A better (problem-specific) perl hash? by srdst13
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |