comment on

I am working on making a hash of the frequency of all "words" of length "n" in a genome. The genome contains 3 billion bases (letters in the set ACTG where A pairs with T and G with C). Using simple perl hashes, I can do this for up to word size 12 with 4Gb of RAM. It seems like it might be possible to do better than this by using the fact that we know that each letter contains only 2 bits of information. Can someone enlighten me about how perl does its hashing and what might be a better solution (more memory-efficient hash, while not sacrificing too much speed)?

An example of what the data look like for word size 11 is given here:

    Word        Count
===========     =====
CAATGACTGAT     1052
AATGACTGATG     1426
ATGACTGATGT     1170
TGACTGATGTC     1105
GACTGATGTCC     781
ACTGATGTCCT     1148
CTGATGTCCTT     1468
TGATGTCCTTC     916
...

Code is here:

sub index_file {
    my %params = @_;

    my $hashref = exists($params{hashref}) ? $params{hashref} : {};
    my $file = $params{file};
    my $window = $params{window};
    
    open(INF,$file) or die "Cannot open file $file : $!";
    
    print "Reading file....\n";
    my $sequence;
    while (my $line = <INF>) {
      chomp($line);
      $sequence .= $line unless ($line=~/^>.*/);
    }
    close(INF);
    $sequence =~ tr/a-z/A-Z/;
    $sequence =~ s/N//g;
    
    my $offset=0;
    print "Calculating....\n";
    for ($offset=0;$offset<length($sequence)-$window;$offset++) {
      print "$offset\n" if ($offset % 10000000 ==0); 
      $hashref->{substr($sequence,$offset,$window)}++;
    }
    
    return($hashref);
}
[download]

Thanks,
Sean

In reply to A better (problem-specific) perl hash? by srdst13

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.