Re: Speeding up stalled script

There's a fair chunk of unconventional and sloppy code in there which I haven't time to comment on blow by blow, so instead here's the first chunk of the code cleaned up somewhat:

use strict;
use warnings;

my $start_time = time;
my ($input1, $input2) = @ARGV;

open my $in, '<', $input1 or die "Can't read source file $input1 : $!\
+n";

my @lengths = grep{! m/\>/} <$in>;
close $in;
chomp @lengths;

open $in, '<', $input2 or die "Can't read source file $input2 : $!\n";
my @source = <$in>;
close $in;
chomp @source;

#********************#
# CALCULATE LENGTH DISTRIBUTION FROM INPUT FILE #1
#********************#
my @sorted = sort {$a <=> $b} @lengths;
my %seen;
my @uniques = grep {!$seen{$_}++} @sorted;

# hash of predicted sORF length (key) and number of times (value) that
+ size is
# observed in the multifasta input file #1
my %dstrbtn_hash;

for my $len (@uniques) {
    dstrbtn_hash{$len} = grep{$len == $_} @sorted;
}
[download]

which probably doesn't solve the problem, but maybe points you in the direction of better technique.

I suspect the real issue is in the EXTRACT and START "loops". I suspect that depending on input data those loops could spend an indeterminately long time not achieving much. A small sample of your input data would help understand what's supposed to be going on there and find a more deterministic way of calculating the values you need.

Perl is the programming world's equivalent of English

Comment on Re: Speeding up stalled script Download Code

Replies are listed 'Best First'.
Re^2: Speeding up stalled script by Anonymous Monk on Feb 03, 2015 at 06:47 UTC
Thank you to all the Monks! I made the mods suggested by you, GrandFather, and by the Monks before you on this node. For the larger datasets there does not appear to be any difference. I am running one such job, and its been 30 minutes already. As you point out, the problem is likely with my algorithm/logic/approach in EXTRACT and START loops not scaling up well...Any thoughts on improving this scale-up for massive files? Some input file datasets can be found at http://bit.ly/1K69JuQ The command line syntax would be perl Length_dstrbtn_seq_extractor.pl Ath167_sORF.facleaned-up_ReMapped_v2-5p-flanking.fa Athaliana_167_TAIR10.cds_primaryTranscriptOnly.facleaned-up perl Length_dstrbtn_seq_extractor.pl Ath167_sORF.facleaned-up_ReMapped_v2-5p-flanking.fa Athaliana_167_intron_ONLY_FASTextract-intronic-seqs.fasta perl Length_dstrbtn_seq_extractor.pl Ath167_sORF.facleaned-up_ReMapped_v2-3p-flanking.fa Athaliana_167_TAIR10.cds_primaryTranscriptOnly.facleaned-up perl Length_dstrbtn_seq_extractor.pl Ath167_sORF.facleaned-up_ReMapped_v2-3p-flanking.fa Athaliana_167_intron_ONLY_FASTextract-intronic-seqs.fasta There are other datasets, some super small, and one set that is still uploading that is very very large	[reply]
Re^3: Speeding up stalled script by GrandFather (Saint) on Feb 04, 2015 at 05:43 UTC
As a general thing we would rather see you include the minimum data required to along with your node. Linking to data elsewhere has the issue that the linked data may change, be removed or move. In this case your link seems to be broken. If you would like further help with this issue I suggest you mock up a very small data set (no more than a few dozen lines of text) for us to play with. It also helps if you can indicate the type of output expected from running the script against your sample data set so we know what we are aiming at. Perl is the programming world's equivalent of English	[reply]


Pathologically Eclectic Rubbish Lister
	PerlMonks