wadunn has asked for the wisdom of the Perl Monks concerning the following question:
I am still a pretty green Perl user, but have a few functional programs under my belt. I am using Perl to crunch DNA sequences for my lab. We have been manually looking for a group of genes in the African malaria mosquito and want to send off the coordinates that we discovered to the powers that be. I thought that I would write a program that took a giant sequence of DNA (called a contig) and extracted the genes from it based on the coordinates that we found in order to double check each set of locations before we sent them off.
For those non biologists, the search file is basically a large text file and I want my program to go so many letters from the start, take a given number of letters after that and place them in a variable (think of this as a word). Then go further downstream and pick another word, and then one more. Then I concatenate the words into a sentence (the final gene). I was not anticipating a difficult program.
I thought I would use ‘substr’ because its seems tailor made for this. When I tried it on a test file (660 letters) using three calls to substr and placing the three ‘words’ into separate variables, everything behaved beautifully.
Problems arose when I went to the actual search file (2,082,241 letters). With the exact same script my extracted substrings were being pulled from the sequence AHEAD of the target sequence as confirmed by my manual location of the coordinates that the program was given. Am I missing something about the behavior of substr that it would act differently on a VERY large string?
I know that gigantic strings are memory blackholes but at my current level of expertise I don’t know how to avoid loading the whole file. I don’t care how slow it runs, I can run the thing overnight (eventually will be unleashed on about 150 genes); it just needs to work. I know that your job is not to look over idiot programmers’ code but I included mine incase it helps you see what I am trying to do or spot any mistake I may have over looked.
Thank you so much and I will be praying to the perl gods until your comments come.
Augustine
This is my test data that works:#!/usr/bin/perl -w use warnings; use strict; use diagnostics; $/ = "//"; my $input = 'CUTR;AAABtest;plus;185-190;438-440;576-579'; #two sets of "instructions" I can interchange in $input for testing #'CUTR25;AAAB01008851;plus;764935-764946;765050-765289;765372-76 +5659' #'CUTR;AAABtest;plus;185-190;438-440;576-579' my @input = split (/;/, $input); my $name = $input[0]; my $contig = $input[1]; my $ori = $input[2]; my @E1 = split (/-/, $input[3]); my @E2 = split (/-/, $input[4]); my @E3 = split (/-/, $input[5]); print "$name\n$contig\n$ori\n@E1\n@E2\n@E3\n\n"; #get and edit contig so it can be searched and used my $contigfile = "$contig.txt"; open(CONTIG, $contigfile) or die (">Could not open $contigfile!\n$!" +); my $rawseq = <CONTIG>; my $goodseq = editcontig($rawseq); #clear giant variable to free some memory???? $rawseq = ''; #Construct exon 1 and save to $EXON1 my $offset1 = $E1[0]-1; my $length1 = $E1[1]-$offset1; my $EXON1 = ''; $EXON1 = substr($goodseq, $offset1, $length1); #Construct exon 2 my $offset2 = $E2[0]-1; my $length2 = $E2[1]-$offset2; my $EXON2 = ''; $EXON2 = substr($goodseq, $offset2, $length2); #Construct exon3 my $offset3 = $E3[0]-1; my $length3 = $E3[1]-$offset3; my $EXON3 = ''; $EXON3 = substr($goodseq, $offset3, $length3); print "$EXON1\n\n$EXON2\n\n$EXON3"; exit; ########### SUBS ############ sub editcontig { my ($rawseq) = @_; $rawseq =~ s/\s//g; $rawseq =~ s/0//g; $rawseq =~ s/1//g; $rawseq =~ s/2//g; $rawseq =~ s/3//g; $rawseq =~ s/4//g; $rawseq =~ s/5//g; $rawseq =~ s/6//g; $rawseq =~ s/7//g; $rawseq =~ s/8//g; $rawseq =~ s/9//g; return $rawseq; }
01 aagaagaagf agaagaagag agagaacttg gaaacagaat tgtagaacag atttatagac 61 agagcaaagt acaattccgt ttcagacaag taacaagagt gaaatagata taagactcag 121 agaaagtaaa aagcgtcaag taaggtacag tgaggagaaa gtaatgagtg tgtatgttca 181 atcgBROOKS tacccttgcg tatctcctcc ttcctcagcc accaacagtt catactcctg 241 cgcagcgtat cgatcccagt cagtaagcta aaaggtcaaa cacatgttat tagatggcat 301 ttcgatagaa tgtcatcgtg ctattcttac atcgtcgttt tcgcgctgtg caaatggatc 361 acgctgctcg tgctccatat acttctccag attgaagaac gtgtcgaaga aaatgggcgt 421 cattttgcat cgctttaWAS ggaccaacgt tacacagcca ggagttgctg gttttatcat 481 atcaagcatc tggaagcaaa cgagtactcc ttagcatatc gaaccaactc catctaccac 541 tccccaaaca aacacatact tgacacaggc agtctHEREa gggtagggtt tcgataccga 601 gcgattccat gcgatgctgc tgctcctcgt agaagtactc cagctcgtac atcgagagga
|
|---|