sugar has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks, With reference to the query "sorting and other problems in hash" posted by me last week, i have several doubts. As per the corrections given by you all, and after thoroughly understanding each and every line of the code, i tried to solve the problem. Ofcourse i did get the results only for the sample file and not for the original file. The reason being that the original file is 8GB in size, i doubt if this is the right approach. Just to load the file and count the lines in it using a while, it takes 1 minute. Then why is the following program takes forever, but not print the output at all? Please help me solve this problem. I need to process the 8 gig of data by any means :(
#!/usr/bin/perl use strict; use warnings; if ( scalar @ARGV < 2 ) { die("Usage:\n\t$0 file1 file2\n\n"); } my %data; my $fn = $ARGV[0]; { my $sequence; open DF, $fn or die $!; while ( my $line = <DF> ) { chomp $line; next if $line =~ m/^\s*$/; if ( substr( $line, 0, 1 ) eq '>' ) { ( $sequence, undef ) = split /\s+/, $line; } else { @{ $data{$sequence} } = split /\s+/, $line; } } close DF; } $fn = $ARGV[1]; { my $sequence; my $side; my $len; open DF, $fn or die $!; while ( my $line = <DF> ) { chomp $line; next if $line =~ m/^\s*$/; if ( substr( $line, 0, 1 ) eq '>' ) { ( $sequence, $side, undef, $len ) = split /(?:_|=|\s)+/, $line; if ( defined @{ $data{$sequence} } ) { my @range = ( 0, $len - 1 ); if ( $side ne "left" ) { @range = ( -1 * $len, -1 ); } print $line, "\n"; print join( ' ', @{ $data{$sequence} } [ $range[0] .. $range[1] ] ), "\n"; } else { warn "No data read from " . $ARGV[0] . " for sequence $sequence\n"; } } else { print join( " ", split //, $line ), "\n"; } } close DF; }

Replies are listed 'Best First'.
Re: When the input file is huge !!!
by tilly (Archbishop) on Jan 06, 2009 at 03:53 UTC
    You have 2 files. If the first is large (a few hundred MB) then I'd expect perl to run out of memory before you've loaded it. If you're using virtual memory this might take a loong time. if it succeeded in loading that but is swapping, then the second will take forever to process. Either way you aren't handling 8 GB of data in RAM at once.

    I would use a sort utility on both files. That is available as a Unix utility. I have not used it, but Sort::External is a pure Perl version if you don't have that utility. Then process both files in parallel. With the idea being that sequences come up in the same order in both files. So you have 2 filehandles (one for each file) and 2 last lines (one for each file), and you always read from whichever one is smaller, processing a match when you find one. That way you do not keep any data in RAM.

    Be warned that sorting 8 GB is liable to take 20 minutes or so on your machine.

      This sounds like a pretty good idea! A simple sort-merge algorithm will take each "hunk of data" that it can handle, sort that and write it to a new place on disk. This requires one disk read of the entire data set and one write of the data set. 8 GB in the scheme of things is not that "big".
      Let's say that each "hunk" is just 500 MB, which my Windows machine can sort easily, we wind up with 16 "hunks". The merge will open say 16 files at once and the next part is easy, just move the top record of each of the 16 to the output. So for "small data sets" like 8GB: 1)read once, 2)write once, 3)read again, 4) write again.
      I think that the system utils are faster than this. Very smart ones will shovel stuff between various disks to speed access up and algorithms are smarter than described above. Anyway "sort" on a big machine is heavily optimized. The Unix command line will probably do much better than you think.
        FYI a simple merge-sort will take log(n)/log(2) passes through the data. You are describing a somewhat more intelligent merge that will take fewer passes. Plus you are eliminating most of the passes by using a different sort at once.

        The merge is likely somewhat slower than you describe. The problem is how parallel read/write streams the drive can handle without resorting to streaming. My memory says that the limit is 4. Which means that you can at most read 3 chunks and write 1. If you try to do more, then your disk starts to thrash. If that is right, then it can handle 16 hunks in 3 passes through the dataset, where each pass has to both read and write all of the data. That is in addition to the original pass that sorts it into chunks. That means the dataset has to be streamed 8 times (4 times from disk, 4 times to). If streaming takes 1 minute per pass (that was what the OP said about his dataset) that would be 8 minutes. That is substantially faster than what I said, but I like to be conservative when I tell people that things will be slow because I don't want them to prematurely give up if it is sower than they expect.

        Also note that a pure Perl implementation probably will have trouble sorting 500 MB chunks of data in RAM because of the memory overhead of internal data structures. However if you are reading from and writing to the filesystem, you can wind up creating, writing, reading and deleting a file with all operations happening in RAM without getting flushed to disk. If intelligently used, this fact can greatly increase performance because most passes do not touch disk. This can make a fairly naive merge-sort perform reasonably close to a much more optimized one.

Re: When the input file is huge !!!
by BrowserUk (Patriarch) on Jan 06, 2009 at 03:55 UTC
    The reason being that the original file is 8GB in size,

    Which file is 8GB? You're processing 2 files. are they both 8GB?

    Other important pieces of information:

    1. how many sequences are there in each of those files?

      A simple grep '>' file | wc -l will tell you.

    2. The same number in both?
    3. Do the sequences appear in the same order in both files?

    With the right answers, your problem could be much simpler and quicker than your current approach. Even if the answers aren't so helpful, there is almost certainly a better approach than your current one, but you do need to provide the information,


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      hi,
      1)There are 13572668 lines in file 1 and 27145336 lines in file2. [with the grep function - the number of sequences in file1 is 6786334 and file2 is 13572668] 2)Both the file have a set of unique ids. So, file2 is double the number of id's(for left and right) compared to file1. 3)The size of the file1 is 8GB and file2 is 2GB. 4)The sequences appear in the sorted order for both the files. therefore, the first id in file1 and the first 2 id's(one for left, one for right) of file2 will be the same.
      Hope i have provided with all the information you have asked for. Please guide !!! thank you

        This is untested (so may need minor tweaks), and the error messages and checking are prefunctory, but it should be close:

        Update: Corrected a few typos and tested on a small (500,000) sequence dataset (It took 3.5 minutes, so yours should be done in around an hour. It uses < 2 MB ram)

        #! perl -slw use strict; use Time::HiRes qw[ time ]; sub readFASTA { my( $fh ) = @_; local $/ = "\n"; my $header = <$fh>; $header = substr $header, 1 if ord $header == ord '>'; ## First li +ne chomp $header; local $/ = '>'; my $seq = <$fh>; chomp $seq; ## get rid of trailing '>'; $seq =~ tr[\n][]d; return $header, \$seq; } sub writeFASTA { local $\ = "\n"; my( $fh, $header, $refSeq, $lineLength ) = @_; $lineLength ||= 50; my $length = length( $$refSeq ); my $chunks = int( $length / $lineLength ); --$chunks unless $length % $lineLength; print $fh '>', $header; print $fh substr $$refSeq, $_*$lineLength, $lineLength for 0 .. $c +hunks; return; } open SMALL, '<', 'theSmallFile' or die $!; open BIG, '<', 'theBigFile' or die $!; until( eof( SMALL ) ) { my( $bigHdr, $bigRef ) = readFASTA( \*BIG ); my( $hdrLeft, $leftRef ) = readFASTA( \*SMALL ); my( $leftId, $leftLen ) = $hdrLeft =~ m[([^_]+)_left\s+length=(\d ++)]; die "Bad small left header '$hdrLeft'" unless defined $leftId and defined $leftLen; my( $hdrRight, $rightRef ) = readFASTA( \*SMALL ); my( $rightId, $rightLen ) = $hdrRight =~ m[([^_]+)_right\s+length= +(\d+)]; die "Bad small right header '$hdrRight'" unless defined $rightId and defined $rightLen; die "Out of sequence data at record $.: $bigHdr/$leftId/$rightId" unless $bigHdr eq $leftId and $bigHdr eq $rightId; my @bits = split ' ', $$bigRef; writeFASTA( \*STDOUT, $hdrLeft, \join(' ', @bits[ 0 .. $leftLen -1 ] ), 1e6 ); writeFASTA( \*STDOUT, $hdrRight, \join( ' ', @bits[ -$rightLen .. -1 ] ), 1e6 ); }

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.