statsman5 has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, Im a statistician doing some work in the field of Biostatistics and need to extract some information to help with my research. Im pretty new to Perl, but have tried writing my own code to extract DNA base sequences from FASTA formatted files. I have text files and fasta files saved to my computer of each different "species" that I need. And I have this code so far....
#!/usr/bin/perl ($root,$gene) = (@ARGV); open(TXT,$root.".txt") || die "Cannot open $root.txt"; open(FASTA, $root.".fasta") || die "Cannot open $root.fasta"; $found=0; while (<TXT>) { ($start,$stop) = m/^$gene \((\d+)..(\d+)\)/io || next; $found=1; last; } die "Did not find gene $gene in $root.txt" unless $found=1; $found = 0; while (<FASTA>) { chop; @x=split; if ($x[0] >= $start) { # start-logic here; } } while (<FASTA>) { chop; @x=split; if ($x[0] <= $stop) { # stop-logic here } } # print logic here
If I go to my command-line prompt and type: perl DNA_sequences.pl NC_001666 rps12, my program will (or should) go into NC_001666.txt and figure out the range of bases that the gene rps12 spans.
rps12 -(92301..93101) rps7 -(91772..92242) ndhB -(89236..91472) trnL -(88584..88664) trnI -(84881..84954) rpl23 -(84433..84714) rpl2 -(82930..84414)
If it is, for example, from base 91772 to 92242, then the program will go into the fasta file and pull out all bases from the 91772nd to the 92242nd, inclusively. If in the text file, there is a negative sign in front of the range (like in this one), then the order of the bases printed out should be reversed, and all a's should be converted to t's, and all c's to g's and vice versa. Im sure a lot of people here already know this, but every fasta file has a header that is the entire first line, and then the bases start on the next line. If you would like to see one, here's the fasta file link for species NC_001666...
http://www.ncbi.nlm.nih.gov/nuccore/11994090?report=fasta&log$=seqview
Everything after the $found=0 before the while(<FASTA>) is just me guessing on how to start that part, and I didn't know where else to go with it. The top part of the code with getting the start and stop range from the text file works pretty well, I think. I also just need to figure out how to reverse the output like I said when there's a negative sign. Thanks for taking the time to look at this...I got it started, but then was at a loss for how to get the bases out of the fasta file and only print out the certain range of them I need depending on the given species and gene. -statsman5

Replies are listed 'Best First'.
Re: Extracting DNA sequences from FASTA files
by Anonymous Monk on Jun 29, 2009 at 08:14 UTC

      I agree with the above, there are plenty of modules to help with parsing FASTA, the best documanted and supported are on BioPerl though:
      http://www.bioperl.org/wiki/FASTA_sequence_format

      Or you can set the line delimiter to the FASTA record separator (>) :
      $/ = '>';
      and slurp in the FASTA one record at a time.

      As far as reverse complmenting goes, i am sure there is support out there, but it is very simple to do yourself:

      my $seq = 'AGCTGATCGTAATAGAGCTA'; my $rev = rev_comp($seq,); print "The reverse complement of \'$seq\' is \'$rev\'\n"; sub rev_comp{ my $in = shift; ## reverse it my $rev_comp = reverse($in); ## complment it using tr/// my $count = $rev_comp =~ tr/AGCT/TCGA/i; ## or tr/AGCTN/TCGAN/i if y +ou have N's ## check we changed all the bases, or did something weird occur... if ($count = length($rev_comp){ return $rev_comp; } else { ## not all bases were changed, ## so something weird is happenning ## maybe you still have whitespace, or N's etc... return an error; } }

      Obviously this is a lengthy way of doing it but i am trying to keep things clear! Hope it helps.

      Just a something something...
Re: Extracting DNA sequences from FASTA files
by Anonymous Monk on Jun 29, 2009 at 10:58 UTC
    to reverse the bases:
    my %ibase; $ibase{"A"} = "T"; $ibase{"T"} = "A"; $ibase{"C"} = "G"; $ibase{"G"} = "C"; $ibase{"N"} = "N"; sub seq_inverse { my ($nseq) = @_; my @seq = split(m//,$nseq); my $lim = @seq; my $inv = ""; $inv .= $ibase{$seq[$lim]} while($lim > 0 and $lim--); return $inv; }
      It's easier when you work *with* Perl rather than against it:
      sub reverse_complement { local $_ = shift; y/ACGT/TGCA/; reverse; }
      As for the OP's question, FASTA is a pretty simple format other than the various stuff people put in the sequence labels. So your code will probably look something like this:
      while (<FASTA>) { chomp; if (/^>\s*(.*)/) { # it's a sequence label $name = $1; } else { # it's sequence data } }