in reply to When the input file is huge !!!

The reason being that the original file is 8GB in size,

Which file is 8GB? You're processing 2 files. are they both 8GB?

Other important pieces of information:

  1. how many sequences are there in each of those files?

    A simple grep '>' file | wc -l will tell you.

  2. The same number in both?
  3. Do the sequences appear in the same order in both files?

With the right answers, your problem could be much simpler and quicker than your current approach. Even if the answers aren't so helpful, there is almost certainly a better approach than your current one, but you do need to provide the information,


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re^2: When the input file is huge !!!
by sugar (Beadle) on Jan 06, 2009 at 04:18 UTC
    hi,
    1)There are 13572668 lines in file 1 and 27145336 lines in file2. [with the grep function - the number of sequences in file1 is 6786334 and file2 is 13572668] 2)Both the file have a set of unique ids. So, file2 is double the number of id's(for left and right) compared to file1. 3)The size of the file1 is 8GB and file2 is 2GB. 4)The sequences appear in the sorted order for both the files. therefore, the first id in file1 and the first 2 id's(one for left, one for right) of file2 will be the same.
    Hope i have provided with all the information you have asked for. Please guide !!! thank you

      This is untested (so may need minor tweaks), and the error messages and checking are prefunctory, but it should be close:

      Update: Corrected a few typos and tested on a small (500,000) sequence dataset (It took 3.5 minutes, so yours should be done in around an hour. It uses < 2 MB ram)

      #! perl -slw use strict; use Time::HiRes qw[ time ]; sub readFASTA { my( $fh ) = @_; local $/ = "\n"; my $header = <$fh>; $header = substr $header, 1 if ord $header == ord '>'; ## First li +ne chomp $header; local $/ = '>'; my $seq = <$fh>; chomp $seq; ## get rid of trailing '>'; $seq =~ tr[\n][]d; return $header, \$seq; } sub writeFASTA { local $\ = "\n"; my( $fh, $header, $refSeq, $lineLength ) = @_; $lineLength ||= 50; my $length = length( $$refSeq ); my $chunks = int( $length / $lineLength ); --$chunks unless $length % $lineLength; print $fh '>', $header; print $fh substr $$refSeq, $_*$lineLength, $lineLength for 0 .. $c +hunks; return; } open SMALL, '<', 'theSmallFile' or die $!; open BIG, '<', 'theBigFile' or die $!; until( eof( SMALL ) ) { my( $bigHdr, $bigRef ) = readFASTA( \*BIG ); my( $hdrLeft, $leftRef ) = readFASTA( \*SMALL ); my( $leftId, $leftLen ) = $hdrLeft =~ m[([^_]+)_left\s+length=(\d ++)]; die "Bad small left header '$hdrLeft'" unless defined $leftId and defined $leftLen; my( $hdrRight, $rightRef ) = readFASTA( \*SMALL ); my( $rightId, $rightLen ) = $hdrRight =~ m[([^_]+)_right\s+length= +(\d+)]; die "Bad small right header '$hdrRight'" unless defined $rightId and defined $rightLen; die "Out of sequence data at record $.: $bigHdr/$leftId/$rightId" unless $bigHdr eq $leftId and $bigHdr eq $rightId; my @bits = split ' ', $$bigRef; writeFASTA( \*STDOUT, $hdrLeft, \join(' ', @bits[ 0 .. $leftLen -1 ] ), 1e6 ); writeFASTA( \*STDOUT, $hdrRight, \join( ' ', @bits[ -$rightLen .. -1 ] ), 1e6 ); }

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        That was awesome !!! took 55 minutes ...thank you very very veryy much :)
        Thanks for your reply. Am happy that the program produces output for such a huge file now. but, sorry to tell you that, after minor changes though i managed to get the output, the numbers fetched are wrong for some sequences, where do i have to check to make the corrections?