Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Speeding up stalled script

by onlyIDleft (Scribe)
on Feb 03, 2015 at 02:30 UTC ( [id://1115332]=perlquestion: print w/replies, xml ) Need Help??

onlyIDleft has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I seek your help in identifying the best and fastest algorithm for my bioinformatics problem. Please allow me to explain (got deleted when I edited my original post)

For every run, I have 2 input files, each with up to a few million DNA sequences (sometimes much fewer). From input file 1, I will infer the length distribution of all sequences i.e. how many times a certain length is observed. GOAL: To extract from input file 2, as many DNA sequences and with an identical length distribution as in input file 1. The Perl script I have written is quick ONLY for small input file pairs. IT does not scale up at all for large inputs it seems. That is the help I seek from you. Thank you very much!

UPDATE1: Script edited to improve readability

UPDATE2: I checked my script, and I have to admit, I do not see changes / errors compared to the previous time that I ran it. So the question is how can I modify it to make it much faster, because I can only imagine that the large input file sizes need to be compensated for by a speedier version. So modifications in logic / syntax etc will be MOST welcome. Thank you!

#! usr/bin/perl # Length_dstrbtn_seq_extractor.pl # This PERL script accepts two input files - the first input is a mult +ifasta file containing alternating lines of fasta headers and sequenc +es generated using using the Perl script fix_multifasta0.pl. # This Perl scripts calculates the frequency distribution of sequences + of all sizes in this input file. # The second input file is the user chosen multifasta sequence file fr +om which sequences of the same frequency distribution of lengths will + be extracted. # The only output is multifasta sequence file. The fasta headers speci +fy source sequence, start coordinate and stop coordinate. # Syntax : perl Length_dstrbtn_seq_extractor.pl <input1> <input2-sourc +e4extraction> #********************# #PROCESS INPUT FILES #********************# use strict; use warnings; my $start_time = time; my $input1 = shift @ARGV; my $input2 = shift @ARGV; my (@lengths, @source, @distrbtn, @output); open(IN1, '<', $input1) or die "Can't read source file $input1 : $!\n" +; while(<IN1>) { chomp; if ($_=~ m/\>/) { #looks for match to the '>' character in the header line # if match to fasta header, does nothing } elsif ($_!~ m/\>/) { #looks for non-match to the '>' character in the sequence line push @lengths, length($_); # if match to fasta sequence, calculates and collects length of sequen +ce } } close IN1; open(IN2, '<', $input2) or die "Can't read source file $input2 : $!\n" +; while(<IN2>) { chomp; if ($_=~ m/\>/) { #looks for match to the '>' character in the header line push @source, $_; # if match to fasta header, includes in array } elsif ($_!~ m/\>/) { #looks for non-match to the '>' character in the sequence line push @source, $_; # if match to fasta sequence, includes sequence in array } } close IN2; #********************# # CALCULATE LENGTH DISTRIBUTION FROM INPUT FILE #1 #********************# my @sorted = sort {$a <=> $b}@lengths; my %seen = (); my @uniques = grep { !$seen{$_}++ } @sorted; foreach my $len(@uniques) { push @distrbtn, $len; my $index = 0; my $count = 0; while($index <= $#sorted) { if($len == $sorted[$index]) { $count++; } $index++; } push @distrbtn, $count; } my %dstrbtn_hash = @distrbtn; # hash of predicted sORF length (key) an +d number of times (value) that size is observed in the multifasta inp +ut file #1 # print @distrbtn, "\n"; # works thus far #********************# # EXTRACT SEQUENCES (RANDOMLY) FROM INPUT2, WITH IDENTICAL LENGTH DIST +RIBUTION OF INPUT 1 #********************# my $header_count = 1; foreach my $key (keys %dstrbtn_hash) { my $size = $key - 3; # the sORF size is calculated only for coding reg +ion with the stop codon length (3 nucleotides - TAA, TGA or TAG) remo +ved to obtain JUST the coding sequence's length my $freq = $dstrbtn_hash{$key}; # the number of times that a certain s +ORF size is seen in the input file #1, as calculated by the earlier p +ortion of this Perl script my $iteration = 1; while ($iteration <= $freq) { my ($temp_source_seq, $temp_source_seq_len); EXTRACT: # choose a random sequence ONLY if it is as long or longer # than the length of sequence that needs to be extracted out { my $chosen=int(rand($#source)); $chosen++ if(($chosen%2) == 0); # even numbered array index is for fasta header, and therefore not of +interest, # instead, we are interested in odd numbered array index number for th +e fasta sequence # which precedes or follows the header $temp_source_seq = $source[$chosen]; $temp_source_seq_len = length ($temp_source_seq); redo EXTRACT if($temp_source_seq_len < $size); } START: # choose a random start coordinate ONLY if the substring starting at t +hat position # falls within the end coord of the sequence that it is part of { my $random_start_coord = int(rand($temp_source_seq_len +)); redo START if(($random_start_coord + $size) > $tem +p_source_seq_len); my $extracted_seq = substr($temp_source_seq, $random_start +_coord, $size); push @output,">".$header_count."extracted_seq" +, "\n"; push @output, $extracted_seq, "\n"; $header_count++; } $iteration++; } } #********************# # WRITE TO OUTPUT FILE, REPORT TIME TAKEN FOR CALCULATION #********************# my $filename = $input1.$input2."_extracted_seqs.fasta"; open (OUT, '>', $filename) or die "Can't write to file $filename : + $!\n"; print OUT @output; my $end_time = time; my $duration = ($end_time - $start_time)/60; print "Thank you for your patience, this Perl script operation has com +pleted, it took $duration minutes, good bye!", " \n"; close OUT; #********************#

Replies are listed 'Best First'.
Re: Speeding up stalled script
by BrowserUk (Patriarch) on Feb 03, 2015 at 03:52 UTC

    On initial inspection, I strongly suspect that the problem you describe with larger input file is because of your profligate use of memory pushing your process into swapping and thus slowing it down by a factor of 1,000 times or more.

    An example. In this loop:

    foreach my $len(@uniques) { push @distrbtn, $len; my $index = 0; my $count = 0; while( $index <= $#sorted ) { if( $len == $sorted[$index] ) { $count++; } $index++; } push @distrbtn, $count; }

    You build an array by alternately pushing the length; and then the count of sequences of that length.

    And then as soon as the loop finishes you convert that array into a hash:

    my %dstrbtn_hash = @distrbtn;

    And (AFAI saw), the array is never referenced again, but sticks around using memory for the rest of the script.

    Why build the array only to convert it into a hash? Why not just do:

    my %dstrbtn_hash; foreach my $len(@uniques) { my $index = 0; my $count = 0; while( $index <= $#sorted ) { if( $len == $sorted[$index] ) { $count++; } $index++; } $dstrbtn_hash{ $len } = $count; }

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re: Speeding up stalled script
by GrandFather (Saint) on Feb 03, 2015 at 04:07 UTC

    There's a fair chunk of unconventional and sloppy code in there which I haven't time to comment on blow by blow, so instead here's the first chunk of the code cleaned up somewhat:

    use strict; use warnings; my $start_time = time; my ($input1, $input2) = @ARGV; open my $in, '<', $input1 or die "Can't read source file $input1 : $!\ +n"; my @lengths = grep{! m/\>/} <$in>; close $in; chomp @lengths; open $in, '<', $input2 or die "Can't read source file $input2 : $!\n"; my @source = <$in>; close $in; chomp @source; #********************# # CALCULATE LENGTH DISTRIBUTION FROM INPUT FILE #1 #********************# my @sorted = sort {$a <=> $b} @lengths; my %seen; my @uniques = grep {!$seen{$_}++} @sorted; # hash of predicted sORF length (key) and number of times (value) that + size is # observed in the multifasta input file #1 my %dstrbtn_hash; for my $len (@uniques) { dstrbtn_hash{$len} = grep{$len == $_} @sorted; }

    which probably doesn't solve the problem, but maybe points you in the direction of better technique.

    I suspect the real issue is in the EXTRACT and START "loops". I suspect that depending on input data those loops could spend an indeterminately long time not achieving much. A small sample of your input data would help understand what's supposed to be going on there and find a more deterministic way of calculating the values you need.

    Perl is the programming world's equivalent of English

      Thank you to all the Monks! I made the mods suggested by you, GrandFather, and by the Monks before you on this node. For the larger datasets there does not appear to be any difference. I am running one such job, and its been 30 minutes already. As you point out, the problem is likely with my algorithm/logic/approach in EXTRACT and START loops not scaling up well...Any thoughts on improving this scale-up for massive files? Some input file datasets can be found at http://bit.ly/1K69JuQ

      The command line syntax would be

      perl Length_dstrbtn_seq_extractor.pl Ath167_sORF.facleaned-up_ReMapped_v2-5p-flanking.fa Athaliana_167_TAIR10.cds_primaryTranscriptOnly.facleaned-up

      perl Length_dstrbtn_seq_extractor.pl Ath167_sORF.facleaned-up_ReMapped_v2-5p-flanking.fa Athaliana_167_intron_ONLY_FASTextract-intronic-seqs.fasta

      perl Length_dstrbtn_seq_extractor.pl Ath167_sORF.facleaned-up_ReMapped_v2-3p-flanking.fa Athaliana_167_TAIR10.cds_primaryTranscriptOnly.facleaned-up

      perl Length_dstrbtn_seq_extractor.pl Ath167_sORF.facleaned-up_ReMapped_v2-3p-flanking.fa Athaliana_167_intron_ONLY_FASTextract-intronic-seqs.fasta

      There are other datasets, some super small, and one set that is still uploading that is very very large

        As a general thing we would rather see you include the minimum data required to along with your node. Linking to data elsewhere has the issue that the linked data may change, be removed or move. In this case your link seems to be broken.

        If you would like further help with this issue I suggest you mock up a very small data set (no more than a few dozen lines of text) for us to play with. It also helps if you can indicate the type of output expected from running the script against your sample data set so we know what we are aiming at.

        Perl is the programming world's equivalent of English
Re: Speeding up stalled script
by Athanasius (Archbishop) on Feb 03, 2015 at 03:29 UTC

    Hello onlyIDleft,

    Not an answer, just a couple of observations. This code (comments removed):

    while(<IN1>) { chomp; if ($_=~ m/\>/) { } elsif ($_!~ m/\>/) { push @lengths, length($_); } }

    can be replaced with the equivalent:

    while (<IN1>) { chomp; push @lengths, length $_ unless />/; }

    Update: No need to escape > in a regex.

    And similarly, this code (comments removed):

    while(<IN2>) { chomp; if ($_=~ m/\>/) { push @source, $_; } elsif ($_!~ m/\>/) { push @source, $_; } }

    is equivalent to:

    chomp, push @source, $_ while <IN2>;

    :-)

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Speeding up stalled script
by graff (Chancellor) on Feb 03, 2015 at 07:38 UTC
    Taking BrowserUK's advice a little further, I think that relative to the OP code, you can replace lines 21-80 with the following (28 lines):
    open(IN1, '<', $input1) or die "Can't read source file $input1 : $!\n" +; my $minlength = 1<<20; my %distrbtn; while(<IN1>) { next if />/; chomp; my $len = length(); $distrbtn{$len}++; $minlength = $len if ( $minlength > $len ); } close IN1; $minlength -= 3; open(IN2, '<', $input2) or die "Can't read source file $input2 : $!\n" +; my $header; my @source; my %source_lengths; while(<IN2>) { chomp; if ( />/ ) { $header = $_; } elsif ( length() >= $minlength ) { push @source, $header; push @source, $_; push @{$source_lengths{length()}}, $#source; } } close IN2;
    (updated to fix how header lines are skipped in the first input file)

    That handles three issues: First, your "%distrbtn_hash" (declared in my code here as "%distrbtn") can be built directly while reading the first input file - this eliminates four redundant arrays and a lot of unnecessary code.

    Second, (I don't know whether this makes any difference regarding your second input file, but) the "@source" array doesn't need to store any strings that are shorter than the shortest line (minus 3) found in your first input file. (I'm assuming that you really want to keep the header strings with their associated data strings from file2.)

    Third, since your "EXTRACT" block seems to be trying to locate source strings of particular lengths for each of the string lengths found in the first input, it will make things a lot easier if you index the source strings according to their lengths - that is what the "%source_lengths" hash is doing.

    That way, as you loop over the lengths found in the first input file, you know exactly how many entries from the second file have a suitable length, and can choose from that set of sources randomly, and know exactly where to find each entry in the @source array according to its length.

    I don't understand your selection criteria well enough to finish that part of the code, but it might start with something like this:

    my $max_source_length = ( sort {$b<=>$a} keys %source_lengths )[0]; for my $key ( sort {$a<=>$b} keys %distrbtn ) { my $size = $key - 3; my $freq = $distrbtn{$key}; # find the first set of source strings of equal or greater size: my $source_key = $key; while ( $source_key <= $max_source_length and not exists( $source_lengths{$source_key} )) { $source_key++; } if ( $source_key > $max_source_length ) { die "We can't do this: strings from $input2 aren't long enough +"; } ... }
    The point about that last "if" condition is that it's not clear to me that the "input2" data will necessarily satisfy all the selection criteria.

    UPDATE: In case you're confused or unsure about using a hash of arrays, the next snippet (which could be placed after the last "if" condition above) might help clarify:

    my @usable_sources = @{$source_lengths{$source_key}}; printf "for an input1 string of length %d, we can choose from %d i +nput2 strings\n", $size, scalar @usable_sources; # in case we want to add more sources that happen to be longer: $source_key++; while ( $source_key <= $max_source_length and scalar @usable_sources < $freq ) { push @usable_sources, @{$source_lengths{$source_key}} if exist +s($source_lengths{$source_key}; } if ( $freq > @usable_sources ) { warn "We ran short of desired frequency for length $size\n"; elsif ( $freq < @usable_sources ) { # do something to randomly remove items from @usable_sources.. +. } for my $offset ( @usable_sources ) { my $header = $source[$offset-1]; my $string = $source[$offset]; # do whatever... }
    Can't really say anything more unless you can show us some sample data from each input, with some desired outputs (that is what GrandFather was asking for - not command-line syntax). It would also be helpful to have some statistics about the "really large file": if page faults are a problem (because the data and index info is larger than available RAM), there are other ways to index into a large data file without using huge in-memory hashes and arrays (and hashes of arrays).

    (please note that none of the above is tested; also: updated to spell BrowserUK's name correctly)

    (And one last update to add a missing close-curly in the last snippet - which is still untested.)

Re: Speeding up stalled script
by hdb (Monsignor) on Feb 03, 2015 at 08:50 UTC

    Having integrated comments from fellow monks I would suggest four additional changes:

    1. In the second file, compute and store the length of the sequence alongside the sequence. Ignore headers.
    2. When picking a random sequence first filter on the required length. This might slow things down if sequences in file 1 are much shorter than in file 2 but give it a try.
    3. Subtract the desired size from the length before determining the starting point of the sequence.
    4. Print the sequences into the output file directly instead of first storing them in an array. This saves another array.
    I have no time now to test my code, so apologies for any mistakes...

    use strict; use warnings; my $start_time = time; my $input1 = shift @ARGV; my $input2 = shift @ARGV; my %dstrbtn_hash; open(IN1, '<', $input1) or die "Can't read source file $input1 : $!\n" +; while(<IN1>){ chomp; $dstrbtn_hash{ length($_) }++ unless />/; } close IN1; open(IN2, '<', $input2) or die "Can't read source file $input2 : $!\n" +; my @source; while(<IN2>){ # ignore header my $seq = <IN2>; chomp $seq; my $len = length( $seq ); push @source, [ $seq, $len ]; # keep length alongside with header } close IN2; my $filename = $input1.$input2."_extracted_seqs.fasta"; open (OUT, '>', $filename) or die "Can't write to file $filename : $!\ +n"; my $header_count = 1; while( my ($key, $freq) = each %dstrbtn_hash) { my $size = $key - 3; for my $iteration (1..$freq) { my $temp_source_seq; my @cand; EXTRACT: { @cand = map { $_->[0] } grep { $_->[1] >= $size } @source; + # filter sequences long enough die "No long enough sequence found in $input2\n" unless @c +and; my $chosen=int(rand(@cand)); $temp_source_seq = $cand[$chosen]; } START: { my $temp_source_seq_len = length ($temp_source_seq); my $random_start_coord = int(rand($temp_source_seq_len-$si +ze)); # substract $size to avoid loop my $extracted_seq = substr($temp_source_seq, $random_start +_coord, $size); print OUT ">".$header_count."extracted_seq", "\n"; print OUT $extracted_seq, "\n"; $header_count++; } } } close OUT; my $end_time = time; my $duration = ($end_time - $start_time)/60; print "Thank you for your patience, this Perl script operation has com +pleted, it took $duration minutes, good bye!", " \n";

    UPDATE: Couple of changes pointed out by karlgoethebier. Thanks!

      Thank you to all the Monks for your valuable edits / modifications/ complete re-writes. I learned some things about reducing memory usage and ridding useless variables etc. Appreciate it!

      Since none of your suggestions were working either, I write a short script to see the length distribution explicitly, and there was sequences from input file # 1 that were 0 (zero) in length as well as longer than the longest sequence in input file # 2. This will cause my scrip tp search without terminating. And this was also confirmed by hdb's script because it exits with an error message after error-checking

      Anti-climactically, the error in the input files that were being fed to me was the reason that my original script that was working stopped working. In the end, a good lesson for me, and useful to get pointers about scripting I wouldnt have otherwise gathered from you all. So thank you all, again!

Re: Speeding up stalled script
by xiaoyafeng (Deacon) on Feb 04, 2015 at 08:44 UTC
    considering your background, I think the best & quickest way to speed up your script is using MCE and replacing while with MCE::Loop:
    use MCE::Loop; mce_loop_f { chomp; do_work($_) } "/path/to/file";




    I am trying to improve my English skills, if you see a mistake please feel free to reply or /msg me a correction

Re: Speeding up stalled script
by CoVAX (Beadle) on Feb 06, 2015 at 08:18 UTC

    This thread caused me to search for "perl debugger" in the hope that I could gain access to memory usage and etc. But the version of perl I'm using (ActivePerl for Windows) wasn't built with -DDEBUGGING.

    Anyway I did discover Devel::Size ( http://search.cpan.org/search?query=devel%3A%3Asize&mode=module ) for finding the memory usage of Perl variables. Maybe the OP will find it useful for comparing alternatives.

    I also learned about "Debugging Perl Memory Usage" ( http://perldoc.perl.org/perldebguts.html#Debugging-Perl-Memory-Usage )

Re: Speeding up stalled script
by marioroy (Prior) on Feb 13, 2015 at 05:12 UTC

    Below, using the MCE::Flow model for ease of passing MCE options and code block. This will increase performance for $input1 by 2.5x or more.

    use MCE::Flow; my $input1 = shift @ARGV; my @lengths; mce_flow_f { chunk_size => '2m', max_workers => 'auto', use_slurpio => 1, gather => \@lengths }, sub { my ($mce, $slurp_ref, $chunk_id) = @_; my (@chunk_lengths); open my $IN1, '<', $slurp_ref; while (<$IN1>) { chomp; push @chunk_lengths, length $_ unless />/; } close $IN1; MCE->gather(@chunk_lengths); }, $input1; MCE::Flow::finish; print scalar @lengths, "\n";
Re: Speeding up stalled script
by marioroy (Prior) on Feb 13, 2015 at 05:33 UTC

    The following is another way. The block receives an array reference. Between the two, slurping is faster 2.023s versus 2.904s for a FASTA file containing 13.7 million lines excluding headers.

    use MCE::Flow; my $input1 = shift @ARGV; my @lengths; mce_flow_f { chunk_size => '2m', max_workers => 'auto', gather => \@lengths }, sub { my ($mce, $chunk_ref, $chunk_id) = @_; my (@chunk_lengths); chomp @{ $chunk_ref }; foreach (@{ $chunk_ref }) { push @chunk_lengths, length $_ unless />/; } MCE->gather(@chunk_lengths); }, $input1; MCE::Flow::finish; print scalar @lengths, "\n";
Re: Speeding up stalled script
by sundialsvc4 (Abbot) on Feb 03, 2015 at 17:28 UTC

    While we’re on this subject, here’s something I’d really like to know:   how memory-hungry is a line like this . . .

    foreach my $key (keys %dstrbtn_hash) {

    ... versus, say, an iterator, e.g. each() applied to the same hash?

    Is Perl going to build an in-memory anonymous array of all those keys, in order to subsequently foreach through it?

    I’ve no doubt that the root cause of this problem is virtual-memory thrashing.   But, is a statement like this one a “hidden” source of more memory consumption?

      Hello sundialsvc4,

      Update: My conclusion was wrong, keys does consume more memory than each. See choroba’s comment and my reply, below.


      Is Perl going to build an in-memory anonymous array of all those keys, in order to subsequently foreach through it?

      No, Perl uses the same internal iterator in both cases:

      So long as a given hash is unmodified you may rely on keys, values and each to repeatedly return the same order as each other.
      ...
      Each hash or array has its own internal iterator, accessed by each, keys, and values.
      — quoted from each, but see also keys and values.

      Note that this internal iterator (one per hash) is implicitly reset when successive calls to each (or keys, or values) have exhausted the available hash entries. But it can also be reset explicitly, for which the recommended method is to call keys @array in void context (see values).

      Hope that helps,

      Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

        The iterator is used by keys, but not on each iteration of the for loop, but for the construction of the list of keys. Otherwise, the following program wouldn't output
        $VAR1 = { 'x' => 'cabd' };

        The value would be shorter, as there would be less keys after the second iteration.

        #!/usr/bin/perl use warnings; use strict; use Data::Dumper; my %hash = ( a => 11, b => 12, c => 13, d => 14, ); for my $key (keys %hash) { delete $hash{$_} for qw( a b c d ); $hash{'x'} .= $key; } print Dumper \%hash;
        لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1115332]
Front-paged by BrowserUk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (5)
As of 2024-03-29 01:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found