1)There are 13572668 lines in file 1 and 27145336
lines in file2. [with the grep function - the
number of sequences in file1 is 6786334 and
file2 is 13572668]
2)Both the file have a set of unique ids. So,
file2 is double the number of id's(for left
and right) compared to file1.
3)The size of the file1 is 8GB and file2 is 2GB.
4)The sequences appear in the sorted order for
both the files. therefore, the first id in file1
and the first 2 id's(one for left, one for right)
of file2 will be the same.
Hope i have provided with all the information you have asked for. Please guide !!! thank you | [reply] [d/l] |
This is untested (so may need minor tweaks), and the error messages and checking are prefunctory, but it should be close:
Update: Corrected a few typos and tested on a small (500,000) sequence dataset (It took 3.5 minutes, so yours should be done in around an hour. It uses < 2 MB ram)
#! perl -slw
use strict;
use Time::HiRes qw[ time ];
sub readFASTA {
my( $fh ) = @_;
local $/ = "\n";
my $header = <$fh>;
$header = substr $header, 1 if ord $header == ord '>'; ## First li
+ne
chomp $header;
local $/ = '>';
my $seq = <$fh>;
chomp $seq; ## get rid of trailing '>';
$seq =~ tr[\n][]d;
return $header, \$seq;
}
sub writeFASTA {
local $\ = "\n";
my( $fh, $header, $refSeq, $lineLength ) = @_;
$lineLength ||= 50;
my $length = length( $$refSeq );
my $chunks = int( $length / $lineLength );
--$chunks unless $length % $lineLength;
print $fh '>', $header;
print $fh substr $$refSeq, $_*$lineLength, $lineLength for 0 .. $c
+hunks;
return;
}
open SMALL, '<', 'theSmallFile' or die $!;
open BIG, '<', 'theBigFile' or die $!;
until( eof( SMALL ) ) {
my( $bigHdr, $bigRef ) = readFASTA( \*BIG );
my( $hdrLeft, $leftRef ) = readFASTA( \*SMALL );
my( $leftId, $leftLen ) = $hdrLeft =~ m[([^_]+)_left\s+length=(\d
++)];
die "Bad small left header '$hdrLeft'"
unless defined $leftId and defined $leftLen;
my( $hdrRight, $rightRef ) = readFASTA( \*SMALL );
my( $rightId, $rightLen ) = $hdrRight =~ m[([^_]+)_right\s+length=
+(\d+)];
die "Bad small right header '$hdrRight'"
unless defined $rightId and defined $rightLen;
die "Out of sequence data at record $.: $bigHdr/$leftId/$rightId"
unless $bigHdr eq $leftId and $bigHdr eq $rightId;
my @bits = split ' ', $$bigRef;
writeFASTA(
\*STDOUT, $hdrLeft,
\join(' ', @bits[ 0 .. $leftLen -1 ] ), 1e6
);
writeFASTA(
\*STDOUT, $hdrRight,
\join( ' ', @bits[ -$rightLen .. -1 ] ), 1e6
);
}
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
That was awesome !!!
took 55 minutes ...thank you very very veryy much :)
| [reply] |
Thanks for your reply. Am happy that the program produces output for such a huge file now. but, sorry to tell you that, after minor changes though i managed to get the output, the numbers fetched are wrong for some sequences, where do i have to check to make the corrections?
| [reply] |