comment on

dear monks, am having a result file which looks like this:

GENSCAN 1.0    Date run: 11-Mar-109    Time: 00:29:37

Sequence scaffold_2 : 47583 bp : 37.42% C+G : Isochore 1 ( 0 - 100 C+G
+%)

Parameter matrix: test.smat

Predicted genes/exons:

Gn.Ex Type S .Begin ...End .Len Fr Ph I/Ac Do/T CodRg P.... Tscr..
----- ---- - ------ ------ ---- -- -- ---- ---- ----- ----- ------

 1.01 Init -   2604   2462  143  0  2   77   96   184 0.883  20.81
 1.00 Prom -   3677   3638   40                               2.82

 2.00 Prom +   4382   4421   40                               1.62
 2.01 Init +   6395   6399    5  1  2   73   84     0 0.122  -0.16
 2.02 Intr +   8270   8412  143  2  2   74   82   110 0.141  12.00
 2.03 Term +   8516   8523    8  0  2   46   46     0 0.473  -8.05
 2.04 PlyA +   9723   9728    6                               2.27

Predicted peptide sequence(s):

Predicted coding sequence(s):


>scaffold_2|GENSCAN_predicted_peptide_1|48_aa
MDRSHQIWDEKLMDQKCVLYPVCLWHEQESEMNWEAEKAERLRTEQKX

>scaffold_2|GENSCAN_predicted_CDS_1|144_bp
atggatcggagtcatcagatatgggatgagaagctgatggatcagaagtgcgtcctctac
cccgtctgcctctggcacgagcaggagagcgagatgaactgggaggctgagaaagctgag
aggctaaggactgaacagaaaggn

>scaffold_2|GENSCAN_predicted_peptide_2|51_aa
MRKNEGVDDADPLIEGMKEITLKDSATPEDKDQEDKQDERGEEIQEQPQEM

>scaffold_2|GENSCAN_predicted_CDS_2|156_bp
atgaggaagaatgagggtgttgatgatgcagatccactaatagaaggtatgaaggagatc
actctgaaagattcagcaactccagaagacaaggatcaagaagacaaacaagatgagaga
ggtgaagaaattcaagaacaacctcaagagatgtag
[download]

This is the content of one of the result files and i have 40,000 files like this. From these files, i have to copy only the sequence header (starting with >) and the sequence(below the header) in 2 different files. one(seqfile1) for "GENSCAN_predicted_CDS" and one(seqfile2) for "GENSCAN_predicted_peptide". so all the sequences having (alphabets in lowercase)with their headers will be sent to seqfile1 and all the sequences having (ALPHABETS in uppercase)with their headers will be sent to seqfile2. hmmmmmm... my program goes like this:

#!/usr/bin/perl

open(OUT1,">seqfile1_lc.out");
open(OUT2,">seqfile2_uc.out");

opendir(DIR,"/home/user/Desktop/resultfolder");
while(($filename=readdir(DIR))) {
open(FILE,"/home/user/Desktop/resultfolder/$filename");
while(<FILE>){
    if($_ =~ m/^>/) && $_=~m/GENSCAN_predicted_CDS/ || $_=~m/GENSCAN_p
+redicted_peptide/) {
        push(@seq,$seq) if $seq;
        $seq=();
        push(@seq,"\n$_");
    } else {
        $_ =~ s/\s*$//;
        $seq = $seq."$_";
    }
    }
}
push(@seq,$seq);

while($_= shift @seq){
    if($_ =~ m/^>/) && $_=~m/GENSCAN_predicted_CDS/ ){
    $header=$_;
    $seq=shift @seq;
    print OUT1 $header;
    print OUT1 $seq;
    }
    if($_ =~ m/^>/) && $_=~m/GENSCAN_predicted_peptide/ ){
    $header=$_;
    $seq=shift @seq;
    print OUT2 $header;
    print OUT2 $seq;


    }
}
[download]

The problem i am facing is, i am gettin seqfile2_uc very perfect as i wanted. But not seqfile1_lc. the first file sequence is printed, n from the second file onwards the other contents from the result file is also printed. Why is this happening? Can anyone plz help . thank you. my result file should look like this:

seqfile1_uc.out
>scaffold_2|GENSCAN_predicted_CDS_1|144_bp
atggatcggagtcatcagatatgggatgagaagctgatggatcagaagtgcgtcctctac
cccgtctgcctctggcacgagcaggagagcgagatgaactgggaggctgagaaagctgag
aggctaaggactgaacagaaaggn
>scaffold_2|GENSCAN_predicted_CDS_2|156_bp
atgaggaagaatgagggtgttgatgatgcagatccactaatagaaggtatgaaggagatc
actctgaaagattcagcaactccagaagacaaggatcaagaagacaaacaagatgagaga
ggtgaagaaattcaagaacaacctcaagagatgtag
........... so on... from all the 40,000 files.

seqfile2_uc.out
>scaffold_2|GENSCAN_predicted_peptide_1|48_aa
MDRSHQIWDEKLMDQKCVLYPVCLWHEQESEMNWEAEKAERLRTEQKX
>scaffold_2|GENSCAN_predicted_peptide_2|51_aa
MRKNEGVDDADPLIEGMKEITLKDSATPEDKDQEDKQDERGEEIQEQPQEM
....... so on...from all the 40,000 files.
[download]

In reply to parsing sequences from a file by sugar

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.