Gemchal has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I am a PhD student who is using perl to analyse some genome data. I am trying to write some code to extract a piece of sequence (given start and end co-ordinates ) from a fasta file of one genome. The fasta file format is:- >genome1 ACTGTTACTTGTACCTCAGGGTTTTCTCTTTTTTTACGCGCTCAGTCAGTCCCATG GTGCTGCCTGCATGCGTCAGTCA etc and then i have a text file which is tab-deliminated and has 3 columns, gene number, gene start and gene finish eg 1 0 825 2 837 1000 etc I would like the perl script to output the sequence for each gene to a different file with the gene number as the file name. Please help, even if its just to suggest where to start, i am feeling a bit lost at the moment. I presume i need to make an array of the gene positions, start and finish? but not sure where to go from there. Thanks Gemma
  • Comment on extract sequence given positions from fasta

Replies are listed 'Best First'.
Re: extract sequence given positions from fasta
by umasuresh (Hermit) on Nov 29, 2010 at 15:53 UTC
Re: extract sequence given positions from fasta
by tospo (Hermit) on Nov 29, 2010 at 16:00 UTC

    Yes, you will want to read your sequence ID/positional data into an array, ideally an array of hashes were the hash keys would be something like "id", "start, "end".
    Then you should probably use Bio::DB::Fasta to retrieve fragments of sequences from your FASTA file, but you could also use Bio::SeqIO and extract regions from your sequences using "substr" on the DNA string.