comment on

I am not familiar with the sequence format presented here to know if BioPerl modules such as Bio::IOSeq can handle it as it is, hence its a good idea to clean up the file first and extract only the needed data, unfortunately after having done just that , I got stuck, I feel I may need to use a data structure like hashes of arrays but I can not land on what exactly I have to do to realize this. Also, I have manually removed the part from your sample file that goes like:

# ----- prediction on sequence number 3 (length = 713, name = seq_03) 
+-----
#
# Constraints/Hints:
# (none)
# Predicted genes for sequence number 3 on both strands
# start gene g4 ....
[as same as above]......so on and on...
[download]

However, following is my initial take at it, I hope a wiser monk than myself can land it at its destination so we can learn something new:

#!/usr/local/bin/perl

#Title "extraction of sequences"
#saved the sample file in bioinfo.txt
use strict;
use warnings;
use IO::File;

my $handle = new IO::File;
$handle->autoflush(1);
$handle->open("<bioinfo.txt")  or die("$!");

my @input_array;
my @new_array;

@input_array=<$handle>;
@input_array = grep {s/#//g} @input_array;

for (my $i=0; $i<$#input_array; $i++){
        chomp $input_array[$i];
        delete $input_array[$i] if $input_array[$i]=~ /((none)|checked
+|constraints|predicted)/i; #shedding extras
        next unless $input_array[$i];           #ignoring empty lines.
        push  @new_array, $input_array[$i];     #capturing the element
+s that I need
             }


for(my $i=0;$i<$#new_array;$i++){
        print "$i-$new_array[$i]\n";         #preparing for further pr
+ocessing
        }
[download]

Here is the output from the snippet above:

0- ----- prediction on sequence number 1 (length = 105, name = seq_01)
+ --
1- start gene g1
2- coding sequence = [atgtcgtccctccccactctcatctttctccaccc
3- atcgctgcggtcctcgccgacccttttgtgccggaagtagggaccgg]
4- protein sequence = [MTASAFVLGTVAFLHNRLRRSRPRQASTAHR
5- GTETPLLRSDKENLTTVLDATILVHSLGQKTNLALGATSSSLDLQKTNLAL
6- VAALTPGIVFPLPSPFVATGLCLQKTNLALGATSSSLDL]
7- end gene g1
8- start gene g2
9- coding sequence = [atgccgtcctcgtcaaagcagctggcgatgcc
10- tcggcccctccttctgcaaaccgccctgccgcccgcctcggctcctccgaa
11- gccgagcagcctacgcaggggccgcagatgctcgcgggagggaatatcgg]
12- protein sequence =[MPLDSSSTPTSNPAPSHSSTAYLLFERLHIAEQ
13- CCPGQGIRHGKWSPGSSEAPT]
14- end gene g2
15- ----- prediction on sequence number 2 (length = 710, name = seq_02
+) -----
16- start gene g3
17- coding sequence = [agctgccctcctcggggccagccttctcttaactc
18- tttgagaccttcaatcctgaggcgtgagacgcagtctggaggagcagctc]
19- protein sequence = [LRRETQSGGAALCSLFDPPPTPTACAHANSP]
[download]

Out of curiosity I have translated the gene sequence and also backtranslated the proteins on ExPasy but they're giving out different results than what is in the sample file you provided and are unrelated. Any clues ?

Excellence is an Endeavor of Persistence. Chance Favors a Prepared Mind.

In reply to Re: extraction of sequences by biohisham
in thread extraction of sequences by patric

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.