in reply to Re: When the input file is huge !!!
in thread When the input file is huge !!!

hi,
1)There are 13572668 lines in file 1 and 27145336 lines in file2. [with the grep function - the number of sequences in file1 is 6786334 and file2 is 13572668] 2)Both the file have a set of unique ids. So, file2 is double the number of id's(for left and right) compared to file1. 3)The size of the file1 is 8GB and file2 is 2GB. 4)The sequences appear in the sorted order for both the files. therefore, the first id in file1 and the first 2 id's(one for left, one for right) of file2 will be the same.
Hope i have provided with all the information you have asked for. Please guide !!! thank you

Replies are listed 'Best First'.
Re^3: When the input file is huge !!!
by BrowserUk (Patriarch) on Jan 06, 2009 at 04:37 UTC

    This is untested (so may need minor tweaks), and the error messages and checking are prefunctory, but it should be close:

    Update: Corrected a few typos and tested on a small (500,000) sequence dataset (It took 3.5 minutes, so yours should be done in around an hour. It uses < 2 MB ram)

    #! perl -slw use strict; use Time::HiRes qw[ time ]; sub readFASTA { my( $fh ) = @_; local $/ = "\n"; my $header = <$fh>; $header = substr $header, 1 if ord $header == ord '>'; ## First li +ne chomp $header; local $/ = '>'; my $seq = <$fh>; chomp $seq; ## get rid of trailing '>'; $seq =~ tr[\n][]d; return $header, \$seq; } sub writeFASTA { local $\ = "\n"; my( $fh, $header, $refSeq, $lineLength ) = @_; $lineLength ||= 50; my $length = length( $$refSeq ); my $chunks = int( $length / $lineLength ); --$chunks unless $length % $lineLength; print $fh '>', $header; print $fh substr $$refSeq, $_*$lineLength, $lineLength for 0 .. $c +hunks; return; } open SMALL, '<', 'theSmallFile' or die $!; open BIG, '<', 'theBigFile' or die $!; until( eof( SMALL ) ) { my( $bigHdr, $bigRef ) = readFASTA( \*BIG ); my( $hdrLeft, $leftRef ) = readFASTA( \*SMALL ); my( $leftId, $leftLen ) = $hdrLeft =~ m[([^_]+)_left\s+length=(\d ++)]; die "Bad small left header '$hdrLeft'" unless defined $leftId and defined $leftLen; my( $hdrRight, $rightRef ) = readFASTA( \*SMALL ); my( $rightId, $rightLen ) = $hdrRight =~ m[([^_]+)_right\s+length= +(\d+)]; die "Bad small right header '$hdrRight'" unless defined $rightId and defined $rightLen; die "Out of sequence data at record $.: $bigHdr/$leftId/$rightId" unless $bigHdr eq $leftId and $bigHdr eq $rightId; my @bits = split ' ', $$bigRef; writeFASTA( \*STDOUT, $hdrLeft, \join(' ', @bits[ 0 .. $leftLen -1 ] ), 1e6 ); writeFASTA( \*STDOUT, $hdrRight, \join( ' ', @bits[ -$rightLen .. -1 ] ), 1e6 ); }

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      That was awesome !!! took 55 minutes ...thank you very very veryy much :)
      Thanks for your reply. Am happy that the program produces output for such a huge file now. but, sorry to tell you that, after minor changes though i managed to get the output, the numbers fetched are wrong for some sequences, where do i have to check to make the corrections?

        Please try the updated code in the node above.

        The problem was I got over-enthusiastic with my big number to disable the wrapping. 1e10 in a integer context (the length field of substr) was translated to -1. Hence one digit was being chopped of the end of the sequences.

        At least that was the only problem I found here when I tested it.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.