Re^2: When the input file is huge !!!

Replies are listed 'Best First'.
Re^3: When the input file is huge !!! by BrowserUk (Patriarch) on Jan 06, 2009 at 04:37 UTC
This is untested (so may need minor tweaks), and the error messages and checking are prefunctory, but it should be close: Update: Corrected a few typos and tested on a small (500,000) sequence dataset (It took 3.5 minutes, so yours should be done in around an hour. It uses < 2 MB ram) #! perl -slw use strict; use Time::HiRes qw[ time ]; sub readFASTA { my( $fh ) = @_; local $/ = "\n"; my $header = <$fh>; $header = substr $header, 1 if ord $header == ord '>'; ## First li +ne chomp $header; local $/ = '>'; my $seq = <$fh>; chomp $seq; ## get rid of trailing '>'; $seq =~ tr[\n][]d; return $header, \$seq; } sub writeFASTA { local $\ = "\n"; my( $fh, $header, $refSeq, $lineLength ) = @_; $lineLength \|\|= 50; my $length = length( $$refSeq ); my $chunks = int( $length / $lineLength ); --$chunks unless $length % $lineLength; print $fh '>', $header; print $fh substr $$refSeq, $_$lineLength, $lineLength for 0 .. $c +hunks; return; } open SMALL, '<', 'theSmallFile' or die $!; open BIG, '<', 'theBigFile' or die $!; until( eof( SMALL ) ) { my( $bigHdr, $bigRef ) = readFASTA( \BIG ); my( $hdrLeft, $leftRef ) = readFASTA( \SMALL ); my( $leftId, $leftLen ) = $hdrLeft =~ m[([^_]+)_left\s+length=(\d ++)]; die "Bad small left header '$hdrLeft'" unless defined $leftId and defined $leftLen; my( $hdrRight, $rightRef ) = readFASTA( \SMALL ); my( $rightId, $rightLen ) = $hdrRight =~ m[([^_]+)_right\s+length= +(\d+)]; die "Bad small right header '$hdrRight'" unless defined $rightId and defined $rightLen; die "Out of sequence data at record $.: $bigHdr/$leftId/$rightId" unless $bigHdr eq $leftId and $bigHdr eq $rightId; my @bits = split ' ', $$bigRef; writeFASTA( \STDOUT, $hdrLeft, \join(' ', @bits[ 0 .. $leftLen -1 ] ), 1e6 ); writeFASTA( \STDOUT, $hdrRight, \join( ' ', @bits[ -$rightLen .. -1 ] ), 1e6 ); } [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re^4: When the input file is huge !!! by sugar (Beadle) on Jan 06, 2009 at 09:32 UTC
That was awesome !!! took 55 minutes ...thank you very very veryy much :)	[reply]
Re^4: When the input file is huge !!! by sugar (Beadle) on Jan 06, 2009 at 05:21 UTC
Thanks for your reply. Am happy that the program produces output for such a huge file now. but, sorry to tell you that, after minor changes though i managed to get the output, the numbers fetched are wrong for some sequences, where do i have to check to make the corrections?	[reply]
Re^5: When the input file is huge !!! by BrowserUk (Patriarch) on Jan 06, 2009 at 06:03 UTC
Please try the updated code in the node above. The problem was I got over-enthusiastic with my big number to disable the wrapping. 1e10 in a integer context (the length field of substr) was translated to -1. Hence one digit was being chopped of the end of the sequences. At least that was the only problem I found here when I tested it. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]