How can I improve my inefficient codes?

FluffyBunny has asked for the wisdom of the Perl Monks concerning the following question:

Hello perlmonks,

I wrote a basic code that can handle some text manipulation. Basically my first input (Mock3_SNP.txt) contains the exact location of the text, and I am trying to find what the text is from Mock3_forSNP.bow. the latter bowfile contains start position and 36 letters (if you understand some basic Biology, it will be chromosome number and starting position and it gives 36 bases -left to right). I succeeded with my small mock files to do the work, but with 20million lines of bowfile wouldnt do this in a few hour. My theoretical running time is actually weeks or months!! Could you guys take a look at my code and help me improve it? Thank you so much for reading!

Mock3_SNP.txt (first input)

Name    Chr    Position
rs1    1    35251
rs2    1    35248
rs3    1    37514
rs4    1    21145
rs5    2    35150
rs6    3    11412
rs7    4    7721
rs8    21    42516
rs9    4    1110
rs10    4    1145
[download]

Mock3_forSNP.bow - second file

PGX_SOLEXA1:2:1:5:804#0/1    -    chr1    35245    GGGAGGTAAAAAAAAAAAA
+AAAAAAAAAAAAAAAAA    BB^Yaa^a\a\D\aaaaa`V`a``abbaabaabaaa    0    "24
+:A>N,34:T>N"
PGX_SOLEXA1:2:4:5:804#0/1    +    chr1    37510    TTTTCTTTTTTTTTTTTTT
+TTTTTTTTTTTTTTTTT    BB^Yaa^a\a\D\aaaaa`V`a``abbaabaabaaa    0    "24
+:A>N,34:T>N"
PGX_SOLEXA1:4:4:5:804#0/1    -    chr1    21140    CCCCCGCCCCCCCCCCCCC
+CCCCCCCCCCCCCCCCC    BB^Yaa^a\a\D\aaaaa`V`a``abbaabaabaaa    0    "24
+:A>N,34:T>N"
PGX_SOLEXA1:4:4:5:774#0/1    -    chr2    35128    AAAAAAAAAAAAAAAAAAA
+AAATAAAAAAAAAAAAA    BB^Yaa^a\a\D\aaaaa`V`a``abbaabaabaaa    0    "24
+:A>N,34:T>N"
PGX_SOLEXA1:4:8:5:474#0/1    +    chr3    11410    CCGCCCCCCCCCCCCCCCC
+CCCCCCCCCCCCCCCCC    BB^Yaa^a\a\D\aaaaa`V`a``abbaabaabaaa    21    "2
+4:A>N,34:T>N"
PGX_SOLEXA1:4:8:5:314#0/1    +    chr5    7720    AGAAAAAAAAAAAAAAAAAA
+AAAAAAAAAAAAAAAA    BB^Yaa^a\a\D\aaaaa`V`a``abbaabaabaaa    0    "24:
+A>N,34:T>N"
PGX_SOLEXA1:15:8:5:314#0/1    -    chr21    42510    GGGGGGTCCCCCCCCCC
+CCCCCCCCCCCCCCCCCCC    BB^Yaa^a\a\D\aaaaa`V`a``abbaabaabaaa    0    "
+24:A>N,34:T>N"
PGX_SOLEXA1:4:9:5:314#0/1    -    chr1    35239    AAAAAAGGGAGGTAAAAAA
+AAAAAAAAAAAAAAAAA    BB^Yaa^a\a\D\aaaaa`V`a``abbaabaabaaa    0    "24
+:A>N,34:T>N"
PGX_SOLEXA1:4:9:21:314#0/1    +    chr4    1110    GAAAAAGGGAGGTAAAAAA
+AAAAAAAAAAAAAAAAG    BB^Yaa^a\a\D\aaaaa`V`a``abbaabaabaaa    0    "24
+:A>N,34:T>N"
[download]

perl script

!/usr/bin/perl

# filename: Read_Bow_SNP.pl                                           
+  
## Script description: This program takes input and match the chromoso
+me location from Bowtie parser file,
#                     then adds the read # to the window (also includi
+ng neighboring windows)


## Perl interpreter command
# use warnings;
use strict;
use warnings;

## Hashes initialized
my %input;
my %bowtieP;

my $location = 0;

## Open files


# file1 - input: create window # from position
open (FILE1, "Mock3_SNP.txt");
my $head = <FILE1>; # If there is a header in the input #1, include th
+is code
while (<FILE1>) {
    chomp;
    my $input_orig = $_;
    my @line = split /\s+/, $_;
    my $name = $line[0];
    my $chr = "chr$line[1]";
    my $pos = $line[2];
    $input{$name}[0] = $name;
    $input{$name}[1] = $chr;
    $input{$name}[2] = $pos;
}
close FILE1;

open (OUT, "> test.txt"); ### Change if file name changes
print OUT "ID\tInput chr\tInput pos\tBow ID\tBow strand\tBow chr\tBow 
+pos\tBow pos+35\tSeq\tSNP\n";

open (FILE2, "Mock3_forSNP.bow"); #bow file
while (<FILE2>) {
    chomp;
    my @line = split /\s+/, $_;
    if ($line[6]==0) {
    my $ID = $line[0];
    my $strand = $line[1];
    my $chr = $line[2];
    my $pos = $line[3];
    my $seq = $line[4];
        
    foreach my $name(keys %input) {
        if ($input{$name}[1] eq $chr) {
            if ($input{$name}[2] > $pos-1 && $input{$name}[2] < $pos+3
+6) {
                #print OUT "$input{$name}[0]\t$input{$name}[1]\t$input
+{$name}[2]\t$ID\t$strand\t$chr\t$pos\t$pos+35\t$seq\n";
                print OUT "$input{$name}[0]\t$input{$name}[1]\t$input{
+$name}[2]\t$ID\t$strand\t$chr\t$pos\t($pos+35)\t$seq\t";
                if ($input{$name}[2] == $pos) {
                    my $fSNP = substr($seq, 0, 1);
                    print OUT "$fSNP\n";    
                }
                if ($input{$name}[2] == $pos+35) {
                    my $fSNP = substr($seq, 35,1);
                    print OUT "$fSNP\n";    
                }
                if ($input{$name}[2] != $pos+35 && $input{$name}[2] !=
+ $pos) {
                    my $SNP = substr($seq, 0, $input{$name}[2]-($pos-1
+));
                    my $fSNP = substr($SNP, $input{$name}[2]-$pos);   
+     
                    print OUT "$fSNP\n";    
                }    
                }
            }        
        }
    }    
}
close FILE2;
close OUT;
[download]

OMG IT'S PERL!!

~(o.o~) (~o.o)~

Perl and a blind date both require regular expression.. -_-

Comment on How can I improve my inefficient codes? Select or Download Code

Replies are listed 'Best First'.
Re: How can I improve my inefficient codes? by moritz (Cardinal) on Nov 10, 2010 at 21:24 UTC
`foreach my $name(keys %input) { if ($input{$name}[1] eq $chr) {` [download] That's a source of needless inefficiency. If you use `$chr` as the hash key (and not `$name` as you do now), you can substitute the loop by a simple hash lookup. If there can be multiple lines with the same $chr, you need to use a hash-of-arrays-of-arrays, and iterate over the outer array. Perl 6 - second systems done right	[reply] [d/l] [select]
Re: How can I improve my inefficient codes? by Tanktalus (Canon) on Nov 10, 2010 at 23:49 UTC
Some of my comments will be based on code improvements with no speed gain, others may address your speed concern. `open (FILE1, "Mock3_SNP.txt");` Use the three-arg form of open, use a lexical, check errors: `open my $input, '<', 'Mock3_SNP.txt' or die "Can't read Mock3_SNP.txt: $!";` - even better is to have the name of the input in a variable that could be passed in on the commandline in the future, but I digress. `while (<FILE1>) { chomp; my $input_orig = $_;` [download] You're reading the file, then copying it to a new variable? And then doing nothing with the new variable? Using a variable other than $_ is good, but no point in duplicating everything around: `while (defined (my $input_line = <$input>)) { chomp $input_line;` [download] `my @line = split /\s+/, $_;` Generally speaking, if you're splitting on spaces, you really should use ' ' instead of /\s+/. It Does The Right Thing more often. It is not, however, exactly the same. See the documentation on split for details. (IIRC, this also means you can get rid of the chomp as the split will implicitly do so.) `my @line = split ' ', $input_line;` [download] `my $chr = "chr$line[1]";` That's an interesting trick. However, if chr values are never very large, i.e., over say 10,000, we could take moritz' suggestion and turn it on its head. Instead of putting this in a hash, put it in an array based on the chr number. Instead of `my %input;`, use `my @input;`. This sparse array will take up a fair bit of space, but should be faster for lookups than a hash. However, if the size gets too big then we get the problem of running out of RAM, starting a swap, and that will definitely be slow. However, if I read your node properly, the numbers will be 1-36, so definitely no problem whatsoever. `$input{$name}[0] = $name; $input{$name}[1] = $chr; $input{$name}[2] = $pos;` [download] This can be done much simpler: `$input{$name} = [ $name, $chr, $pos ];` [download] (though why you need to duplicate $name, I don't know.) Of course, I just suggested turning it on its head and moritz pointed out collisions, so we get: `push @{$input[$chr]} => [ $name, $pos ];` [download] Mind you, I generally prefer named values, so I'd go with: `push @{$input[$chr]} => { name => $name, pos => $pos };` [download] `open (OUT, "> test.txt"); ### Change if file name changes` [download] Again, three-arg, lexical filehandle, check if open failed. However, the comment here really drives home the point I made with the previous open: put it in a variable at the top, which will make it easier for someone to change, but also make it easier to convert into a command line parameter later. `### At the top: my $output_file = 'test.txt'; ### Change if the file name changes ### the open is then: open my $out, '>', $output_file or die "Can't open $output_file: $!";` [download] A bunch of repeat comments since you're handling things about the same as before. And then to the part moritz commented on: `foreach my $name(keys %input) { if ($input{$name}[1] eq $chr) { if ($input{$name}[2] > $pos-1 && $input{$name}[2] < $pos+3 +6) {` [download] This is now just a simple lookup. First, we have to convert $chr to just the number, and then see if there is one or more input entry that matches. `$chr =~ s/^chr//; # remove chr if it's there ### or, if chr must be there: # substr $chr, 0, 3, ''; # removes first three characters if (defined $input[$chr]) { for my $i (@{$input[$chr]}) { if ($pos <= $i->{pos} && $i->{pos} <= $pos+35) {` [download] If you use the array of arrays instead of the array of hashes that I prefer, that'd be `$i->[1]` instead of `$i->{pos}`. (I think the AoA will be faster, but I'd suggest benchmarking it to see if it's significant. Readability is still nice.) `if ($input{$name}[2] == $pos) { my $fSNP = substr($seq, 0, 1); print OUT "$fSNP\n"; } if ($input{$name}[2] == $pos+35) { my $fSNP = substr($seq, 35,1); print OUT "$fSNP\n"; } if ($input{$name}[2] != $pos+35 && $input{$name}[2] != + $pos) { my $SNP = substr($seq, 0, $input{$name}[2]-($pos-1 +)); my $fSNP = substr($SNP, $input{$name}[2]-$pos); + print OUT "$fSNP\n"; }` [download] A few points. First, this should be if/elsif/else since all the if blocks are mutually exclusive. That would cut down on the comparisons. Second, if you're using perl 5.10+, using given/when might be better since it's all the same item you're checking against, `$input{$name}[2]` (or now `$i->{pos}` or `$i->[1]`). Third, it looks like the last one handles all the cases, but somewhat poorly. `my $fSNP = substr($seq, $i->{pos} - $pos, 1); print $out "$fSNP\n";` [download] Untested, and I may be misreading this, but if I have read it properly, this should speed things up since you only need to copy one character once rather than a bunch of characters and then one character. Hope that helps	[reply] [d/l] [select]