comment on

I'm using perl to iterate through a 5.9 GB file (fasta) and store unique ids as keys, I run out of memory though when I start passing through a 14.7 GB file (XML). I need help not running out of memory. I can't use C++ for this assignment I thought the hash would be the best memory/time saver. Is this not the case? Do I have to resort to a slow array?

#!/usr/bin/perl5.8.8

use strict;
use warnings;

my $afa = "uniref100.fasta";  #smallAFA.txt
open (my $AFA,"<", $afa) || die $!; 

my ($ref_coord) = UniRef100_iterator();
IDseq_XMLextractor($ref_coord);

close ($AFA);

sub UniRef100_iterator
{
my %coord;
my $id; 
my $Startlocation;
my $Endlocation;

    while (<$AFA>)
    {
        if ($_ =~/^>UniRef100_([\w\d]+)/)
        {
        $id = $1;
        $Startlocation = tell $AFA;
        }
            else
            {
            $Endlocation = tell $AFA;
            $coord{$id} = "$Startlocation $Endlocation";
            }
    }
return (\%coord);
} #closes sub Indexor


sub IDseq_XMLextractor
{
my ($r_coord)= @_; #capturing %hash
my %coord = %{$ref_coord}; #dereference

my $seqCount = 0; #counts the number of sequences
my $famCount = 0; #counts the number of families
my $fileCount = 1; #counts the number of files

my $xml = "uniref90.xml";  #smallXML.txt
open (my $XML,"<", $xml) || die $!;  

my $outfile = $fileCount."_ProFam";
open (my $OUTFILE,">", $outfile) || die $!;
print $OUTFILE "File$fileCount\n"; #print file header for first file 

open (my $ERROR,">","error.txt");

    while (<$XML>) #is entering the while loop
    {
        if ($_ =~/^<entry id="(UniRef90_[\w\d]+)"/ && $seqCount > 2000
+)
        {
        print $OUTFILE "No. protien families = $famCount; No. of seque
+nces = $seqCount\n\n";
        close ($OUTFILE);
        $seqCount = 0;
        $famCount = 0; 
        $fileCount++;
        $outfile = $fileCount."_ProFam";
        open ($OUTFILE,">", $outfile) || die $!;
        print $OUTFILE "File$fileCount\n";
        }

        if ($_ =~/^<entry id="(UniRef90_[\w\d]+)"/)
        {
        my $id = $1; #UniRef90id
        print $OUTFILE "\n>$id\n";
        $famCount++;
        }

        if ($_ =~/^<property type="UniRef100 ID" value="UniRef100_([\w
+\d]+)"/)
        {
        my $id = $1; #UniRef100id
       
            if (exists  $coord{$id}) 
            {
            my $SEQ;
            my @coord = split(/ /,$coord{$id});
            my $length = $coord[1] - $coord[0]; 
            delete $coord{$id};
            print $OUTFILE ">$id\n";
            seek ($AFA,$coord[0],0);
            read ($AFA,$SEQ,$length);
            print $OUTFILE "$SEQ";
            $seqCount++;
            }
                else {print $ERROR "Key $id does not exist in the hash
+\n"}
        }

}
close ($XML);    
} #closes sub ID_XMLextractor
[download]

In reply to Hashes and Memory by Jeri

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.