FluffyBunny has asked for the wisdom of the Perl Monks concerning the following question:

Hello perlmonks,

I wrote a basic code that can handle some text manipulation. Basically my first input (Mock3_SNP.txt) contains the exact location of the text, and I am trying to find what the text is from Mock3_forSNP.bow. the latter bowfile contains start position and 36 letters (if you understand some basic Biology, it will be chromosome number and starting position and it gives 36 bases -left to right). I succeeded with my small mock files to do the work, but with 20million lines of bowfile wouldnt do this in a few hour. My theoretical running time is actually weeks or months!! Could you guys take a look at my code and help me improve it? Thank you so much for reading!

Mock3_SNP.txt (first input)

Name Chr Position rs1 1 35251 rs2 1 35248 rs3 1 37514 rs4 1 21145 rs5 2 35150 rs6 3 11412 rs7 4 7721 rs8 21 42516 rs9 4 1110 rs10 4 1145

Mock3_forSNP.bow - second file

PGX_SOLEXA1:2:1:5:804#0/1 - chr1 35245 GGGAGGTAAAAAAAAAAAA +AAAAAAAAAAAAAAAAA BB^Yaa^a\a\D\aaaaa`V`a``abbaabaabaaa 0 "24 +:A>N,34:T>N" PGX_SOLEXA1:2:4:5:804#0/1 + chr1 37510 TTTTCTTTTTTTTTTTTTT +TTTTTTTTTTTTTTTTT BB^Yaa^a\a\D\aaaaa`V`a``abbaabaabaaa 0 "24 +:A>N,34:T>N" PGX_SOLEXA1:4:4:5:804#0/1 - chr1 21140 CCCCCGCCCCCCCCCCCCC +CCCCCCCCCCCCCCCCC BB^Yaa^a\a\D\aaaaa`V`a``abbaabaabaaa 0 "24 +:A>N,34:T>N" PGX_SOLEXA1:4:4:5:774#0/1 - chr2 35128 AAAAAAAAAAAAAAAAAAA +AAATAAAAAAAAAAAAA BB^Yaa^a\a\D\aaaaa`V`a``abbaabaabaaa 0 "24 +:A>N,34:T>N" PGX_SOLEXA1:4:8:5:474#0/1 + chr3 11410 CCGCCCCCCCCCCCCCCCC +CCCCCCCCCCCCCCCCC BB^Yaa^a\a\D\aaaaa`V`a``abbaabaabaaa 21 "2 +4:A>N,34:T>N" PGX_SOLEXA1:4:8:5:314#0/1 + chr5 7720 AGAAAAAAAAAAAAAAAAAA +AAAAAAAAAAAAAAAA BB^Yaa^a\a\D\aaaaa`V`a``abbaabaabaaa 0 "24: +A>N,34:T>N" PGX_SOLEXA1:15:8:5:314#0/1 - chr21 42510 GGGGGGTCCCCCCCCCC +CCCCCCCCCCCCCCCCCCC BB^Yaa^a\a\D\aaaaa`V`a``abbaabaabaaa 0 " +24:A>N,34:T>N" PGX_SOLEXA1:4:9:5:314#0/1 - chr1 35239 AAAAAAGGGAGGTAAAAAA +AAAAAAAAAAAAAAAAA BB^Yaa^a\a\D\aaaaa`V`a``abbaabaabaaa 0 "24 +:A>N,34:T>N" PGX_SOLEXA1:4:9:21:314#0/1 + chr4 1110 GAAAAAGGGAGGTAAAAAA +AAAAAAAAAAAAAAAAG BB^Yaa^a\a\D\aaaaa`V`a``abbaabaabaaa 0 "24 +:A>N,34:T>N"

perl script

!/usr/bin/perl # filename: Read_Bow_SNP.pl + ## Script description: This program takes input and match the chromoso +me location from Bowtie parser file, # then adds the read # to the window (also includi +ng neighboring windows) ## Perl interpreter command # use warnings; use strict; use warnings; ## Hashes initialized my %input; my %bowtieP; my $location = 0; ## Open files # file1 - input: create window # from position open (FILE1, "Mock3_SNP.txt"); my $head = <FILE1>; # If there is a header in the input #1, include th +is code while (<FILE1>) { chomp; my $input_orig = $_; my @line = split /\s+/, $_; my $name = $line[0]; my $chr = "chr$line[1]"; my $pos = $line[2]; $input{$name}[0] = $name; $input{$name}[1] = $chr; $input{$name}[2] = $pos; } close FILE1; open (OUT, "> test.txt"); ### Change if file name changes print OUT "ID\tInput chr\tInput pos\tBow ID\tBow strand\tBow chr\tBow +pos\tBow pos+35\tSeq\tSNP\n"; open (FILE2, "Mock3_forSNP.bow"); #bow file while (<FILE2>) { chomp; my @line = split /\s+/, $_; if ($line[6]==0) { my $ID = $line[0]; my $strand = $line[1]; my $chr = $line[2]; my $pos = $line[3]; my $seq = $line[4]; foreach my $name(keys %input) { if ($input{$name}[1] eq $chr) { if ($input{$name}[2] > $pos-1 && $input{$name}[2] < $pos+3 +6) { #print OUT "$input{$name}[0]\t$input{$name}[1]\t$input +{$name}[2]\t$ID\t$strand\t$chr\t$pos\t$pos+35\t$seq\n"; print OUT "$input{$name}[0]\t$input{$name}[1]\t$input{ +$name}[2]\t$ID\t$strand\t$chr\t$pos\t($pos+35)\t$seq\t"; if ($input{$name}[2] == $pos) { my $fSNP = substr($seq, 0, 1); print OUT "$fSNP\n"; } if ($input{$name}[2] == $pos+35) { my $fSNP = substr($seq, 35,1); print OUT "$fSNP\n"; } if ($input{$name}[2] != $pos+35 && $input{$name}[2] != + $pos) { my $SNP = substr($seq, 0, $input{$name}[2]-($pos-1 +)); my $fSNP = substr($SNP, $input{$name}[2]-$pos); + print OUT "$fSNP\n"; } } } } } } close FILE2; close OUT;

OMG IT'S PERL!!

~(o.o~) (~o.o)~

Perl and a blind date both require regular expression.. -_-

Replies are listed 'Best First'.
Re: How can I improve my inefficient codes?
by moritz (Cardinal) on Nov 10, 2010 at 21:24 UTC
    foreach my $name(keys %input) { if ($input{$name}[1] eq $chr) {

    That's a source of needless inefficiency. If you use $chr as the hash key (and not $name as you do now), you can substitute the loop by a simple hash lookup. If there can be multiple lines with the same $chr, you need to use a hash-of-arrays-of-arrays, and iterate over the outer array.

Re: How can I improve my inefficient codes?
by Tanktalus (Canon) on Nov 10, 2010 at 23:49 UTC

    Some of my comments will be based on code improvements with no speed gain, others may address your speed concern.

    1. open (FILE1, "Mock3_SNP.txt"); Use the three-arg form of open, use a lexical, check errors: open my $input, '<', 'Mock3_SNP.txt' or die "Can't read Mock3_SNP.txt: $!"; - even better is to have the name of the input in a variable that could be passed in on the commandline in the future, but I digress.
    2. while (<FILE1>) { chomp; my $input_orig = $_;
      You're reading the file, then copying it to a new variable? And then doing nothing with the new variable? Using a variable other than $_ is good, but no point in duplicating everything around:
      while (defined (my $input_line = <$input>)) { chomp $input_line;
    3.     my @line = split /\s+/, $_; Generally speaking, if you're splitting on spaces, you really should use ' ' instead of /\s+/. It Does The Right Thing more often. It is not, however, exactly the same. See the documentation on split for details. (IIRC, this also means you can get rid of the chomp as the split will implicitly do so.)
      my @line = split ' ', $input_line;
    4. my $chr = "chr$line[1]"; That's an interesting trick. However, if chr values are never very large, i.e., over say 10,000, we could take moritz' suggestion and turn it on its head. Instead of putting this in a hash, put it in an array based on the chr number. Instead of my %input;, use my @input;. This sparse array will take up a fair bit of space, but should be faster for lookups than a hash. However, if the size gets too big then we get the problem of running out of RAM, starting a swap, and that will definitely be slow. However, if I read your node properly, the numbers will be 1-36, so definitely no problem whatsoever.
    5. $input{$name}[0] = $name; $input{$name}[1] = $chr; $input{$name}[2] = $pos;
      This can be done much simpler:
      $input{$name} = [ $name, $chr, $pos ];
      (though why you need to duplicate $name, I don't know.) Of course, I just suggested turning it on its head and moritz pointed out collisions, so we get:
      push @{$input[$chr]} => [ $name, $pos ];
      Mind you, I generally prefer named values, so I'd go with:
      push @{$input[$chr]} => { name => $name, pos => $pos };
    6. open (OUT, "> test.txt"); ### Change if file name changes
      Again, three-arg, lexical filehandle, check if open failed. However, the comment here really drives home the point I made with the previous open: put it in a variable at the top, which will make it easier for someone to change, but also make it easier to convert into a command line parameter later.
      ### At the top: my $output_file = 'test.txt'; ### Change if the file name changes ### the open is then: open my $out, '>', $output_file or die "Can't open $output_file: $!";
    7. A bunch of repeat comments since you're handling things about the same as before. And then to the part moritz commented on:
      foreach my $name(keys %input) { if ($input{$name}[1] eq $chr) { if ($input{$name}[2] > $pos-1 && $input{$name}[2] < $pos+3 +6) {
      This is now just a simple lookup. First, we have to convert $chr to just the number, and then see if there is one or more input entry that matches.
      $chr =~ s/^chr//; # remove chr if it's there ### or, if chr must be there: # substr $chr, 0, 3, ''; # removes first three characters if (defined $input[$chr]) { for my $i (@{$input[$chr]}) { if ($pos <= $i->{pos} && $i->{pos} <= $pos+35) {
      If you use the array of arrays instead of the array of hashes that I prefer, that'd be $i->[1] instead of $i->{pos}. (I think the AoA will be faster, but I'd suggest benchmarking it to see if it's significant. Readability is still nice.)
    8. if ($input{$name}[2] == $pos) { my $fSNP = substr($seq, 0, 1); print OUT "$fSNP\n"; } if ($input{$name}[2] == $pos+35) { my $fSNP = substr($seq, 35,1); print OUT "$fSNP\n"; } if ($input{$name}[2] != $pos+35 && $input{$name}[2] != + $pos) { my $SNP = substr($seq, 0, $input{$name}[2]-($pos-1 +)); my $fSNP = substr($SNP, $input{$name}[2]-$pos); + print OUT "$fSNP\n"; }
      A few points. First, this should be if/elsif/else since all the if blocks are mutually exclusive. That would cut down on the comparisons. Second, if you're using perl 5.10+, using given/when might be better since it's all the same item you're checking against, $input{$name}[2] (or now $i->{pos} or $i->[1]). Third, it looks like the last one handles all the cases, but somewhat poorly.
      my $fSNP = substr($seq, $i->{pos} - $pos, 1); print $out "$fSNP\n";
      Untested, and I may be misreading this, but if I have read it properly, this should speed things up since you only need to copy one character once rather than a bunch of characters and then one character.
    Hope that helps