Hashes and Memory

Jeri has asked for the wisdom of the Perl Monks concerning the following question:

I'm using perl to iterate through a 5.9 GB file (fasta) and store unique ids as keys, I run out of memory though when I start passing through a 14.7 GB file (XML). I need help not running out of memory. I can't use C++ for this assignment I thought the hash would be the best memory/time saver. Is this not the case? Do I have to resort to a slow array?

#!/usr/bin/perl5.8.8

use strict;
use warnings;

my $afa = "uniref100.fasta";  #smallAFA.txt
open (my $AFA,"<", $afa) || die $!; 

my ($ref_coord) = UniRef100_iterator();
IDseq_XMLextractor($ref_coord);

close ($AFA);

sub UniRef100_iterator
{
my %coord;
my $id; 
my $Startlocation;
my $Endlocation;

    while (<$AFA>)
    {
        if ($_ =~/^>UniRef100_([\w\d]+)/)
        {
        $id = $1;
        $Startlocation = tell $AFA;
        }
            else
            {
            $Endlocation = tell $AFA;
            $coord{$id} = "$Startlocation $Endlocation";
            }
    }
return (\%coord);
} #closes sub Indexor


sub IDseq_XMLextractor
{
my ($r_coord)= @_; #capturing %hash
my %coord = %{$ref_coord}; #dereference

my $seqCount = 0; #counts the number of sequences
my $famCount = 0; #counts the number of families
my $fileCount = 1; #counts the number of files

my $xml = "uniref90.xml";  #smallXML.txt
open (my $XML,"<", $xml) || die $!;  

my $outfile = $fileCount."_ProFam";
open (my $OUTFILE,">", $outfile) || die $!;
print $OUTFILE "File$fileCount\n"; #print file header for first file 

open (my $ERROR,">","error.txt");

    while (<$XML>) #is entering the while loop
    {
        if ($_ =~/^<entry id="(UniRef90_[\w\d]+)"/ && $seqCount > 2000
+)
        {
        print $OUTFILE "No. protien families = $famCount; No. of seque
+nces = $seqCount\n\n";
        close ($OUTFILE);
        $seqCount = 0;
        $famCount = 0; 
        $fileCount++;
        $outfile = $fileCount."_ProFam";
        open ($OUTFILE,">", $outfile) || die $!;
        print $OUTFILE "File$fileCount\n";
        }

        if ($_ =~/^<entry id="(UniRef90_[\w\d]+)"/)
        {
        my $id = $1; #UniRef90id
        print $OUTFILE "\n>$id\n";
        $famCount++;
        }

        if ($_ =~/^<property type="UniRef100 ID" value="UniRef100_([\w
+\d]+)"/)
        {
        my $id = $1; #UniRef100id
       
            if (exists  $coord{$id}) 
            {
            my $SEQ;
            my @coord = split(/ /,$coord{$id});
            my $length = $coord[1] - $coord[0]; 
            delete $coord{$id};
            print $OUTFILE ">$id\n";
            seek ($AFA,$coord[0],0);
            read ($AFA,$SEQ,$length);
            print $OUTFILE "$SEQ";
            $seqCount++;
            }
                else {print $ERROR "Key $id does not exist in the hash
+\n"}
        }

}
close ($XML);    
} #closes sub ID_XMLextractor
[download]

Comment on Hashes and Memory Download Code

Replies are listed 'Best First'.
Re: Hashes and Memory by zentara (Cardinal) on Sep 08, 2011 at 16:47 UTC
If it's an XML file, why not use an XML module? See Parsing huge XML file. I'm not really a human, but I play one on earth. Old Perl Programmer Haiku ................... flash japh	[reply]
Re^2: Hashes and Memory by Jeri (Scribe) on Sep 08, 2011 at 17:18 UTC
I'm going to give Twig a shot. Thanks!	[reply]
Re: Hashes and Memory by RichardK (Parson) on Sep 08, 2011 at 17:11 UTC
Hum -- this looks odd `my %coord = %{$ref_coord}; #dereference` Doesn't that copy the hash? It's probably better to use the reference directly. Something like :- `#e.g. my @c = split /\s+/,$ref_coord->{$id};` [download]	[reply] [d/l] [select]
Re^2: Hashes and Memory by Jeri (Scribe) on Sep 08, 2011 at 17:24 UTC
Does it copy the hash? I'm still rather new at perl (just 1 year experience). This could mostly likely be a novice error. How does your code work? `my @c = split /\s+/,$ref_coord->{$id};` Does it create an array based on the space between the coordinates in the hash value? and what does the "->" mean exactly?	[reply] [d/l]
Re^3: Hashes and Memory by Jeri (Scribe) on Sep 08, 2011 at 17:37 UTC
Actually, I understand it. My question is how can I put the coordinates in @c if the hash has not been dereferenced? or am I doing (and thinking) about this all wrong?	[reply]
Re^4: Hashes and Memory by Kc12349 (Monk) on Sep 08, 2011 at 18:13 UTC
Re^4: Hashes and Memory by RichardK (Parson) on Sep 09, 2011 at 09:37 UTC
Re^4: Hashes and Memory by Jeri (Scribe) on Sep 09, 2011 at 15:46 UTC