comment on

Hello perlmonks,

I wrote a basic code that can handle some text manipulation. Basically my first input (Mock3_SNP.txt) contains the exact location of the text, and I am trying to find what the text is from Mock3_forSNP.bow. the latter bowfile contains start position and 36 letters (if you understand some basic Biology, it will be chromosome number and starting position and it gives 36 bases -left to right). I succeeded with my small mock files to do the work, but with 20million lines of bowfile wouldnt do this in a few hour. My theoretical running time is actually weeks or months!! Could you guys take a look at my code and help me improve it? Thank you so much for reading!

Mock3_SNP.txt (first input)

Name    Chr    Position
rs1    1    35251
rs2    1    35248
rs3    1    37514
rs4    1    21145
rs5    2    35150
rs6    3    11412
rs7    4    7721
rs8    21    42516
rs9    4    1110
rs10    4    1145
[download]

Mock3_forSNP.bow - second file

PGX_SOLEXA1:2:1:5:804#0/1    -    chr1    35245    GGGAGGTAAAAAAAAAAAA
+AAAAAAAAAAAAAAAAA    BB^Yaa^a\a\D\aaaaa`V`a``abbaabaabaaa    0    "24
+:A>N,34:T>N"
PGX_SOLEXA1:2:4:5:804#0/1    +    chr1    37510    TTTTCTTTTTTTTTTTTTT
+TTTTTTTTTTTTTTTTT    BB^Yaa^a\a\D\aaaaa`V`a``abbaabaabaaa    0    "24
+:A>N,34:T>N"
PGX_SOLEXA1:4:4:5:804#0/1    -    chr1    21140    CCCCCGCCCCCCCCCCCCC
+CCCCCCCCCCCCCCCCC    BB^Yaa^a\a\D\aaaaa`V`a``abbaabaabaaa    0    "24
+:A>N,34:T>N"
PGX_SOLEXA1:4:4:5:774#0/1    -    chr2    35128    AAAAAAAAAAAAAAAAAAA
+AAATAAAAAAAAAAAAA    BB^Yaa^a\a\D\aaaaa`V`a``abbaabaabaaa    0    "24
+:A>N,34:T>N"
PGX_SOLEXA1:4:8:5:474#0/1    +    chr3    11410    CCGCCCCCCCCCCCCCCCC
+CCCCCCCCCCCCCCCCC    BB^Yaa^a\a\D\aaaaa`V`a``abbaabaabaaa    21    "2
+4:A>N,34:T>N"
PGX_SOLEXA1:4:8:5:314#0/1    +    chr5    7720    AGAAAAAAAAAAAAAAAAAA
+AAAAAAAAAAAAAAAA    BB^Yaa^a\a\D\aaaaa`V`a``abbaabaabaaa    0    "24:
+A>N,34:T>N"
PGX_SOLEXA1:15:8:5:314#0/1    -    chr21    42510    GGGGGGTCCCCCCCCCC
+CCCCCCCCCCCCCCCCCCC    BB^Yaa^a\a\D\aaaaa`V`a``abbaabaabaaa    0    "
+24:A>N,34:T>N"
PGX_SOLEXA1:4:9:5:314#0/1    -    chr1    35239    AAAAAAGGGAGGTAAAAAA
+AAAAAAAAAAAAAAAAA    BB^Yaa^a\a\D\aaaaa`V`a``abbaabaabaaa    0    "24
+:A>N,34:T>N"
PGX_SOLEXA1:4:9:21:314#0/1    +    chr4    1110    GAAAAAGGGAGGTAAAAAA
+AAAAAAAAAAAAAAAAG    BB^Yaa^a\a\D\aaaaa`V`a``abbaabaabaaa    0    "24
+:A>N,34:T>N"
[download]

perl script

!/usr/bin/perl

# filename: Read_Bow_SNP.pl                                           
+  
## Script description: This program takes input and match the chromoso
+me location from Bowtie parser file,
#                     then adds the read # to the window (also includi
+ng neighboring windows)


## Perl interpreter command
# use warnings;
use strict;
use warnings;

## Hashes initialized
my %input;
my %bowtieP;

my $location = 0;

## Open files


# file1 - input: create window # from position
open (FILE1, "Mock3_SNP.txt");
my $head = <FILE1>; # If there is a header in the input #1, include th
+is code
while (<FILE1>) {
    chomp;
    my $input_orig = $_;
    my @line = split /\s+/, $_;
    my $name = $line[0];
    my $chr = "chr$line[1]";
    my $pos = $line[2];
    $input{$name}[0] = $name;
    $input{$name}[1] = $chr;
    $input{$name}[2] = $pos;
}
close FILE1;

open (OUT, "> test.txt"); ### Change if file name changes
print OUT "ID\tInput chr\tInput pos\tBow ID\tBow strand\tBow chr\tBow 
+pos\tBow pos+35\tSeq\tSNP\n";

open (FILE2, "Mock3_forSNP.bow"); #bow file
while (<FILE2>) {
    chomp;
    my @line = split /\s+/, $_;
    if ($line[6]==0) {
    my $ID = $line[0];
    my $strand = $line[1];
    my $chr = $line[2];
    my $pos = $line[3];
    my $seq = $line[4];
        
    foreach my $name(keys %input) {
        if ($input{$name}[1] eq $chr) {
            if ($input{$name}[2] > $pos-1 && $input{$name}[2] < $pos+3
+6) {
                #print OUT "$input{$name}[0]\t$input{$name}[1]\t$input
+{$name}[2]\t$ID\t$strand\t$chr\t$pos\t$pos+35\t$seq\n";
                print OUT "$input{$name}[0]\t$input{$name}[1]\t$input{
+$name}[2]\t$ID\t$strand\t$chr\t$pos\t($pos+35)\t$seq\t";
                if ($input{$name}[2] == $pos) {
                    my $fSNP = substr($seq, 0, 1);
                    print OUT "$fSNP\n";    
                }
                if ($input{$name}[2] == $pos+35) {
                    my $fSNP = substr($seq, 35,1);
                    print OUT "$fSNP\n";    
                }
                if ($input{$name}[2] != $pos+35 && $input{$name}[2] !=
+ $pos) {
                    my $SNP = substr($seq, 0, $input{$name}[2]-($pos-1
+));
                    my $fSNP = substr($SNP, $input{$name}[2]-$pos);   
+     
                    print OUT "$fSNP\n";    
                }    
                }
            }        
        }
    }    
}
close FILE2;
close OUT;
[download]

OMG IT'S PERL!!

~(o.o~) (~o.o)~

Perl and a blind date both require regular expression.. -_-

In reply to How can I improve my inefficient codes? by FluffyBunny

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.