in reply to Re^2: When the input file is huge !!!
in thread When the input file is huge !!!

This is untested (so may need minor tweaks), and the error messages and checking are prefunctory, but it should be close:

Update: Corrected a few typos and tested on a small (500,000) sequence dataset (It took 3.5 minutes, so yours should be done in around an hour. It uses < 2 MB ram)

#! perl -slw use strict; use Time::HiRes qw[ time ]; sub readFASTA { my( $fh ) = @_; local $/ = "\n"; my $header = <$fh>; $header = substr $header, 1 if ord $header == ord '>'; ## First li +ne chomp $header; local $/ = '>'; my $seq = <$fh>; chomp $seq; ## get rid of trailing '>'; $seq =~ tr[\n][]d; return $header, \$seq; } sub writeFASTA { local $\ = "\n"; my( $fh, $header, $refSeq, $lineLength ) = @_; $lineLength ||= 50; my $length = length( $$refSeq ); my $chunks = int( $length / $lineLength ); --$chunks unless $length % $lineLength; print $fh '>', $header; print $fh substr $$refSeq, $_*$lineLength, $lineLength for 0 .. $c +hunks; return; } open SMALL, '<', 'theSmallFile' or die $!; open BIG, '<', 'theBigFile' or die $!; until( eof( SMALL ) ) { my( $bigHdr, $bigRef ) = readFASTA( \*BIG ); my( $hdrLeft, $leftRef ) = readFASTA( \*SMALL ); my( $leftId, $leftLen ) = $hdrLeft =~ m[([^_]+)_left\s+length=(\d ++)]; die "Bad small left header '$hdrLeft'" unless defined $leftId and defined $leftLen; my( $hdrRight, $rightRef ) = readFASTA( \*SMALL ); my( $rightId, $rightLen ) = $hdrRight =~ m[([^_]+)_right\s+length= +(\d+)]; die "Bad small right header '$hdrRight'" unless defined $rightId and defined $rightLen; die "Out of sequence data at record $.: $bigHdr/$leftId/$rightId" unless $bigHdr eq $leftId and $bigHdr eq $rightId; my @bits = split ' ', $$bigRef; writeFASTA( \*STDOUT, $hdrLeft, \join(' ', @bits[ 0 .. $leftLen -1 ] ), 1e6 ); writeFASTA( \*STDOUT, $hdrRight, \join( ' ', @bits[ -$rightLen .. -1 ] ), 1e6 ); }

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re^4: When the input file is huge !!!
by sugar (Beadle) on Jan 06, 2009 at 09:32 UTC
    That was awesome !!! took 55 minutes ...thank you very very veryy much :)
Re^4: When the input file is huge !!!
by sugar (Beadle) on Jan 06, 2009 at 05:21 UTC
    Thanks for your reply. Am happy that the program produces output for such a huge file now. but, sorry to tell you that, after minor changes though i managed to get the output, the numbers fetched are wrong for some sequences, where do i have to check to make the corrections?

      Please try the updated code in the node above.

      The problem was I got over-enthusiastic with my big number to disable the wrapping. 1e10 in a integer context (the length field of substr) was translated to -1. Hence one digit was being chopped of the end of the sequences.

      At least that was the only problem I found here when I tested it.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.