comment on

Dear Monks,

I seek your help in identifying the best and fastest algorithm for my bioinformatics problem. Please allow me to explain (got deleted when I edited my original post)

For every run, I have 2 input files, each with up to a few million DNA sequences (sometimes much fewer). From input file 1, I will infer the length distribution of all sequences i.e. how many times a certain length is observed. GOAL: To extract from input file 2, as many DNA sequences and with an identical length distribution as in input file 1. The Perl script I have written is quick ONLY for small input file pairs. IT does not scale up at all for large inputs it seems. That is the help I seek from you. Thank you very much!

UPDATE1: Script edited to improve readability

UPDATE2: I checked my script, and I have to admit, I do not see changes / errors compared to the previous time that I ran it. So the question is how can I modify it to make it much faster, because I can only imagine that the large input file sizes need to be compensated for by a speedier version. So modifications in logic / syntax etc will be MOST welcome. Thank you!

#! usr/bin/perl
# Length_dstrbtn_seq_extractor.pl

# This PERL script accepts two input files - the first input is a mult
+ifasta file containing alternating lines of fasta headers and sequenc
+es generated using using the Perl script fix_multifasta0.pl.
# This Perl scripts calculates the frequency distribution of sequences
+ of all sizes in this input file.
# The second input file is the user chosen multifasta sequence file fr
+om which sequences of the same frequency distribution of lengths will
+ be extracted.
# The only output is multifasta sequence file. The fasta headers speci
+fy source sequence, start coordinate and stop coordinate.

# Syntax : perl Length_dstrbtn_seq_extractor.pl <input1> <input2-sourc
+e4extraction>

#********************#
#PROCESS INPUT FILES
#********************#
use strict;
use warnings;

my $start_time = time;

my $input1 = shift @ARGV;
my $input2 = shift @ARGV;
my (@lengths, @source, @distrbtn, @output);

open(IN1, '<', $input1) or die "Can't read source file $input1 : $!\n"
+;

while(<IN1>)        {
chomp;    
    if ($_=~ m/\>/)        {                        
#looks for match to the '>' character in the header line
# if match to fasta header, does nothing
        }
    elsif ($_!~ m/\>/)        {                        
#looks for non-match to the '>' character in the sequence line
    push @lengths, length($_);                
# if match to fasta sequence, calculates and collects length of sequen
+ce
        }
}
close IN1;


open(IN2, '<', $input2) or die "Can't read source file $input2 : $!\n"
+;
while(<IN2>)        {
chomp;    
    if ($_=~ m/\>/)        {                        
#looks for match to the '>' character in the header line
    push @source, $_;                                                
# if match to fasta header, includes in array
        }
    elsif ($_!~ m/\>/)        {                        
#looks for non-match to the '>' character in the sequence line
    push @source, $_;                
# if match to fasta sequence, includes sequence in array
        }
}
close IN2;

#********************#
# CALCULATE LENGTH DISTRIBUTION FROM INPUT FILE #1
#********************#
my @sorted = sort {$a <=> $b}@lengths;

my %seen = ();
my @uniques = grep { !$seen{$_}++ } @sorted;

foreach my $len(@uniques)        {
push @distrbtn, $len;
my $index = 0;
my $count = 0;
    while($index <= $#sorted)        {
        if($len == $sorted[$index])
        {
        $count++;
        }
        $index++;
        }
    push @distrbtn, $count;
}

my %dstrbtn_hash = @distrbtn; # hash of predicted sORF length (key) an
+d number of times (value) that size is observed in the multifasta inp
+ut file #1
# print @distrbtn, "\n"; # works thus far

#********************#
# EXTRACT SEQUENCES (RANDOMLY) FROM INPUT2, WITH IDENTICAL LENGTH DIST
+RIBUTION OF INPUT 1
#********************#

my $header_count = 1;
foreach my $key (keys %dstrbtn_hash) {

my $size = $key - 3; # the sORF size is calculated only for coding reg
+ion with the stop codon length (3 nucleotides - TAA, TGA or TAG) remo
+ved to obtain JUST the coding sequence's length
my $freq = $dstrbtn_hash{$key}; # the number of times that a certain s
+ORF size is seen in the input file #1, as calculated by the earlier p
+ortion of this Perl script

my $iteration = 1;
while ($iteration <= $freq) {
    
    my ($temp_source_seq, $temp_source_seq_len);
    
EXTRACT: 
# choose a random sequence ONLY if it is as long or longer 
# than the length of sequence that needs to be extracted out

    {
        my $chosen=int(rand($#source));
        $chosen++ if(($chosen%2) == 0);
# even numbered array index is for fasta header, and therefore not of 
+interest, 
# instead, we are interested in odd numbered array index number for th
+e fasta sequence 
# which precedes or follows the header
        
        $temp_source_seq = $source[$chosen];
        $temp_source_seq_len = length ($temp_source_seq);
        
        redo EXTRACT if($temp_source_seq_len < $size); 
    }

    
        START:
# choose a random start coordinate ONLY if the substring starting at t
+hat position
# falls within the end coord of the sequence that it is part of

            {
                my $random_start_coord = int(rand($temp_source_seq_len
+));
                
                    redo START if(($random_start_coord + $size) > $tem
+p_source_seq_len);
                
            my $extracted_seq = substr($temp_source_seq, $random_start
+_coord, $size);
                        push @output,">".$header_count."extracted_seq"
+, "\n";
                        push @output, $extracted_seq, "\n";
                        $header_count++;
            }

    $iteration++;
        }
}

#********************#
#             WRITE TO OUTPUT FILE, REPORT TIME TAKEN FOR CALCULATION
#********************#

    my $filename = $input1.$input2."_extracted_seqs.fasta";
    open (OUT, '>', $filename) or die "Can't write to file $filename :
+ $!\n";
    print OUT @output;
my $end_time = time;
my $duration = ($end_time - $start_time)/60;
print "Thank you for your patience, this Perl script operation has com
+pleted, it took $duration minutes, good bye!", " \n";
    close OUT;

#********************#
[download]

In reply to Speeding up stalled script by onlyIDleft

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.