Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: Speeding up stalled script

by GrandFather (Saint)
on Feb 03, 2015 at 04:07 UTC ( [id://1115336]=note: print w/replies, xml ) Need Help??


in reply to Speeding up stalled script

There's a fair chunk of unconventional and sloppy code in there which I haven't time to comment on blow by blow, so instead here's the first chunk of the code cleaned up somewhat:

use strict; use warnings; my $start_time = time; my ($input1, $input2) = @ARGV; open my $in, '<', $input1 or die "Can't read source file $input1 : $!\ +n"; my @lengths = grep{! m/\>/} <$in>; close $in; chomp @lengths; open $in, '<', $input2 or die "Can't read source file $input2 : $!\n"; my @source = <$in>; close $in; chomp @source; #********************# # CALCULATE LENGTH DISTRIBUTION FROM INPUT FILE #1 #********************# my @sorted = sort {$a <=> $b} @lengths; my %seen; my @uniques = grep {!$seen{$_}++} @sorted; # hash of predicted sORF length (key) and number of times (value) that + size is # observed in the multifasta input file #1 my %dstrbtn_hash; for my $len (@uniques) { dstrbtn_hash{$len} = grep{$len == $_} @sorted; }

which probably doesn't solve the problem, but maybe points you in the direction of better technique.

I suspect the real issue is in the EXTRACT and START "loops". I suspect that depending on input data those loops could spend an indeterminately long time not achieving much. A small sample of your input data would help understand what's supposed to be going on there and find a more deterministic way of calculating the values you need.

Perl is the programming world's equivalent of English

Replies are listed 'Best First'.
Re^2: Speeding up stalled script
by Anonymous Monk on Feb 03, 2015 at 06:47 UTC

    Thank you to all the Monks! I made the mods suggested by you, GrandFather, and by the Monks before you on this node. For the larger datasets there does not appear to be any difference. I am running one such job, and its been 30 minutes already. As you point out, the problem is likely with my algorithm/logic/approach in EXTRACT and START loops not scaling up well...Any thoughts on improving this scale-up for massive files? Some input file datasets can be found at http://bit.ly/1K69JuQ

    The command line syntax would be

    perl Length_dstrbtn_seq_extractor.pl Ath167_sORF.facleaned-up_ReMapped_v2-5p-flanking.fa Athaliana_167_TAIR10.cds_primaryTranscriptOnly.facleaned-up

    perl Length_dstrbtn_seq_extractor.pl Ath167_sORF.facleaned-up_ReMapped_v2-5p-flanking.fa Athaliana_167_intron_ONLY_FASTextract-intronic-seqs.fasta

    perl Length_dstrbtn_seq_extractor.pl Ath167_sORF.facleaned-up_ReMapped_v2-3p-flanking.fa Athaliana_167_TAIR10.cds_primaryTranscriptOnly.facleaned-up

    perl Length_dstrbtn_seq_extractor.pl Ath167_sORF.facleaned-up_ReMapped_v2-3p-flanking.fa Athaliana_167_intron_ONLY_FASTextract-intronic-seqs.fasta

    There are other datasets, some super small, and one set that is still uploading that is very very large

      As a general thing we would rather see you include the minimum data required to along with your node. Linking to data elsewhere has the issue that the linked data may change, be removed or move. In this case your link seems to be broken.

      If you would like further help with this issue I suggest you mock up a very small data set (no more than a few dozen lines of text) for us to play with. It also helps if you can indicate the type of output expected from running the script against your sample data set so we know what we are aiming at.

      Perl is the programming world's equivalent of English

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1115336]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (7)
As of 2024-04-23 09:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found