Dear Monks,
I seek your help in identifying the best and fastest algorithm for my bioinformatics problem. Please allow me to explain (got deleted when I edited my original post)
For every run, I have 2 input files, each with up to a few million DNA sequences (sometimes much fewer). From input file 1, I will infer the length distribution of all sequences i.e. how many times a certain length is observed. GOAL: To extract from input file 2, as many DNA sequences and with an identical length distribution as in input file 1. The Perl script I have written is quick ONLY for small input file pairs. IT does not scale up at all for large inputs it seems. That is the help I seek from you. Thank you very much!
UPDATE1: Script edited to improve readability
UPDATE2: I checked my script, and I have to admit, I do not see changes / errors compared to the previous time that I ran it. So the question is how can I modify it to make it much faster, because I can only imagine that the large input file sizes need to be compensated for by a speedier version. So modifications in logic / syntax etc will be MOST welcome. Thank you!
#! usr/bin/perl # Length_dstrbtn_seq_extractor.pl # This PERL script accepts two input files - the first input is a mult +ifasta file containing alternating lines of fasta headers and sequenc +es generated using using the Perl script fix_multifasta0.pl. # This Perl scripts calculates the frequency distribution of sequences + of all sizes in this input file. # The second input file is the user chosen multifasta sequence file fr +om which sequences of the same frequency distribution of lengths will + be extracted. # The only output is multifasta sequence file. The fasta headers speci +fy source sequence, start coordinate and stop coordinate. # Syntax : perl Length_dstrbtn_seq_extractor.pl <input1> <input2-sourc +e4extraction> #********************# #PROCESS INPUT FILES #********************# use strict; use warnings; my $start_time = time; my $input1 = shift @ARGV; my $input2 = shift @ARGV; my (@lengths, @source, @distrbtn, @output); open(IN1, '<', $input1) or die "Can't read source file $input1 : $!\n" +; while(<IN1>) { chomp; if ($_=~ m/\>/) { #looks for match to the '>' character in the header line # if match to fasta header, does nothing } elsif ($_!~ m/\>/) { #looks for non-match to the '>' character in the sequence line push @lengths, length($_); # if match to fasta sequence, calculates and collects length of sequen +ce } } close IN1; open(IN2, '<', $input2) or die "Can't read source file $input2 : $!\n" +; while(<IN2>) { chomp; if ($_=~ m/\>/) { #looks for match to the '>' character in the header line push @source, $_; # if match to fasta header, includes in array } elsif ($_!~ m/\>/) { #looks for non-match to the '>' character in the sequence line push @source, $_; # if match to fasta sequence, includes sequence in array } } close IN2; #********************# # CALCULATE LENGTH DISTRIBUTION FROM INPUT FILE #1 #********************# my @sorted = sort {$a <=> $b}@lengths; my %seen = (); my @uniques = grep { !$seen{$_}++ } @sorted; foreach my $len(@uniques) { push @distrbtn, $len; my $index = 0; my $count = 0; while($index <= $#sorted) { if($len == $sorted[$index]) { $count++; } $index++; } push @distrbtn, $count; } my %dstrbtn_hash = @distrbtn; # hash of predicted sORF length (key) an +d number of times (value) that size is observed in the multifasta inp +ut file #1 # print @distrbtn, "\n"; # works thus far #********************# # EXTRACT SEQUENCES (RANDOMLY) FROM INPUT2, WITH IDENTICAL LENGTH DIST +RIBUTION OF INPUT 1 #********************# my $header_count = 1; foreach my $key (keys %dstrbtn_hash) { my $size = $key - 3; # the sORF size is calculated only for coding reg +ion with the stop codon length (3 nucleotides - TAA, TGA or TAG) remo +ved to obtain JUST the coding sequence's length my $freq = $dstrbtn_hash{$key}; # the number of times that a certain s +ORF size is seen in the input file #1, as calculated by the earlier p +ortion of this Perl script my $iteration = 1; while ($iteration <= $freq) { my ($temp_source_seq, $temp_source_seq_len); EXTRACT: # choose a random sequence ONLY if it is as long or longer # than the length of sequence that needs to be extracted out { my $chosen=int(rand($#source)); $chosen++ if(($chosen%2) == 0); # even numbered array index is for fasta header, and therefore not of +interest, # instead, we are interested in odd numbered array index number for th +e fasta sequence # which precedes or follows the header $temp_source_seq = $source[$chosen]; $temp_source_seq_len = length ($temp_source_seq); redo EXTRACT if($temp_source_seq_len < $size); } START: # choose a random start coordinate ONLY if the substring starting at t +hat position # falls within the end coord of the sequence that it is part of { my $random_start_coord = int(rand($temp_source_seq_len +)); redo START if(($random_start_coord + $size) > $tem +p_source_seq_len); my $extracted_seq = substr($temp_source_seq, $random_start +_coord, $size); push @output,">".$header_count."extracted_seq" +, "\n"; push @output, $extracted_seq, "\n"; $header_count++; } $iteration++; } } #********************# # WRITE TO OUTPUT FILE, REPORT TIME TAKEN FOR CALCULATION #********************# my $filename = $input1.$input2."_extracted_seqs.fasta"; open (OUT, '>', $filename) or die "Can't write to file $filename : + $!\n"; print OUT @output; my $end_time = time; my $duration = ($end_time - $start_time)/60; print "Thank you for your patience, this Perl script operation has com +pleted, it took $duration minutes, good bye!", " \n"; close OUT; #********************#
In reply to Speeding up stalled script by onlyIDleft
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |