comment on

Hi all, Im a statistician doing some work in the field of Biostatistics and need to extract some information to help with my research. Im pretty new to Perl, but have tried writing my own code to extract DNA base sequences from FASTA formatted files. I have text files and fasta files saved to my computer of each different "species" that I need. And I have this code so far....


#!/usr/bin/perl

($root,$gene) = (@ARGV);
open(TXT,$root.".txt") || die "Cannot open $root.txt";
open(FASTA, $root.".fasta") || die "Cannot open $root.fasta";

$found=0;
while (<TXT>) { 
    ($start,$stop) = m/^$gene \((\d+)..(\d+)\)/io || next;
    $found=1; 
    last;
}
die "Did not find gene $gene in $root.txt" unless $found=1;

$found = 0;
while (<FASTA>) { 
  chop; 
  @x=split;
  if ($x[0] >= $start) { 
      # start-logic here;
  }
}
while (<FASTA>) { 
  chop; 
  @x=split;
  if ($x[0] <= $stop) { 
      # stop-logic here
  }
}
# print logic here
[download]

If I go to my command-line prompt and type: perl DNA_sequences.pl NC_001666 rps12, my program will (or should) go into NC_001666.txt and figure out the range of bases that the gene rps12 spans.

    rps12       -(92301..93101)
    rps7        -(91772..92242)
    ndhB        -(89236..91472)
    trnL        -(88584..88664)
    trnI        -(84881..84954)
    rpl23       -(84433..84714)
    rpl2        -(82930..84414)
[download]

If it is, for example, from base 91772 to 92242, then the program will go into the fasta file and pull out all bases from the 91772nd to the 92242nd, inclusively. If in the text file, there is a negative sign in front of the range (like in this one), then the order of the bases printed out should be reversed, and all a's should be converted to t's, and all c's to g's and vice versa. Im sure a lot of people here already know this, but every fasta file has a header that is the entire first line, and then the bases start on the next line. If you would like to see one, here's the fasta file link for species NC_001666...

http://www.ncbi.nlm.nih.gov/nuccore/11994090?report=fasta&log$=seqview
[download]

Everything after the $found=0 before the while(<FASTA>) is just me guessing on how to start that part, and I didn't know where else to go with it. The top part of the code with getting the start and stop range from the text file works pretty well, I think. I also just need to figure out how to reverse the output like I said when there's a negative sign. Thanks for taking the time to look at this...I got it started, but then was at a loss for how to get the bases out of the fasta file and only print out the certain range of them I need depending on the given species and gene. -statsman5

In reply to Extracting DNA sequences from FASTA files by statsman5

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.