dear monks, am having a result file which looks like this:
GENSCAN 1.0 Date run: 11-Mar-109 Time: 00:29:37 Sequence scaffold_2 : 47583 bp : 37.42% C+G : Isochore 1 ( 0 - 100 C+G +%) Parameter matrix: test.smat Predicted genes/exons: Gn.Ex Type S .Begin ...End .Len Fr Ph I/Ac Do/T CodRg P.... Tscr.. ----- ---- - ------ ------ ---- -- -- ---- ---- ----- ----- ------ 1.01 Init - 2604 2462 143 0 2 77 96 184 0.883 20.81 1.00 Prom - 3677 3638 40 2.82 2.00 Prom + 4382 4421 40 1.62 2.01 Init + 6395 6399 5 1 2 73 84 0 0.122 -0.16 2.02 Intr + 8270 8412 143 2 2 74 82 110 0.141 12.00 2.03 Term + 8516 8523 8 0 2 46 46 0 0.473 -8.05 2.04 PlyA + 9723 9728 6 2.27 Predicted peptide sequence(s): Predicted coding sequence(s): >scaffold_2|GENSCAN_predicted_peptide_1|48_aa MDRSHQIWDEKLMDQKCVLYPVCLWHEQESEMNWEAEKAERLRTEQKX >scaffold_2|GENSCAN_predicted_CDS_1|144_bp atggatcggagtcatcagatatgggatgagaagctgatggatcagaagtgcgtcctctac cccgtctgcctctggcacgagcaggagagcgagatgaactgggaggctgagaaagctgag aggctaaggactgaacagaaaggn >scaffold_2|GENSCAN_predicted_peptide_2|51_aa MRKNEGVDDADPLIEGMKEITLKDSATPEDKDQEDKQDERGEEIQEQPQEM >scaffold_2|GENSCAN_predicted_CDS_2|156_bp atgaggaagaatgagggtgttgatgatgcagatccactaatagaaggtatgaaggagatc actctgaaagattcagcaactccagaagacaaggatcaagaagacaaacaagatgagaga ggtgaagaaattcaagaacaacctcaagagatgtag
This is the content of one of the result files and i have 40,000 files like this. From these files, i have to copy only the sequence header (starting with >) and the sequence(below the header) in 2 different files. one(seqfile1) for "GENSCAN_predicted_CDS" and one(seqfile2) for "GENSCAN_predicted_peptide". so all the sequences having (alphabets in lowercase)with their headers will be sent to seqfile1 and all the sequences having (ALPHABETS in uppercase)with their headers will be sent to seqfile2. hmmmmmm... my program goes like this:
#!/usr/bin/perl open(OUT1,">seqfile1_lc.out"); open(OUT2,">seqfile2_uc.out"); opendir(DIR,"/home/user/Desktop/resultfolder"); while(($filename=readdir(DIR))) { open(FILE,"/home/user/Desktop/resultfolder/$filename"); while(<FILE>){ if($_ =~ m/^>/) && $_=~m/GENSCAN_predicted_CDS/ || $_=~m/GENSCAN_p +redicted_peptide/) { push(@seq,$seq) if $seq; $seq=(); push(@seq,"\n$_"); } else { $_ =~ s/\s*$//; $seq = $seq."$_"; } } } push(@seq,$seq); while($_= shift @seq){ if($_ =~ m/^>/) && $_=~m/GENSCAN_predicted_CDS/ ){ $header=$_; $seq=shift @seq; print OUT1 $header; print OUT1 $seq; } if($_ =~ m/^>/) && $_=~m/GENSCAN_predicted_peptide/ ){ $header=$_; $seq=shift @seq; print OUT2 $header; print OUT2 $seq; } }
The problem i am facing is, i am gettin seqfile2_uc very perfect as i wanted. But not seqfile1_lc. the first file sequence is printed, n from the second file onwards the other contents from the result file is also printed. Why is this happening? Can anyone plz help . thank you. my result file should look like this:
seqfile1_uc.out >scaffold_2|GENSCAN_predicted_CDS_1|144_bp atggatcggagtcatcagatatgggatgagaagctgatggatcagaagtgcgtcctctac cccgtctgcctctggcacgagcaggagagcgagatgaactgggaggctgagaaagctgag aggctaaggactgaacagaaaggn >scaffold_2|GENSCAN_predicted_CDS_2|156_bp atgaggaagaatgagggtgttgatgatgcagatccactaatagaaggtatgaaggagatc actctgaaagattcagcaactccagaagacaaggatcaagaagacaaacaagatgagaga ggtgaagaaattcaagaacaacctcaagagatgtag ........... so on... from all the 40,000 files. seqfile2_uc.out >scaffold_2|GENSCAN_predicted_peptide_1|48_aa MDRSHQIWDEKLMDQKCVLYPVCLWHEQESEMNWEAEKAERLRTEQKX >scaffold_2|GENSCAN_predicted_peptide_2|51_aa MRKNEGVDDADPLIEGMKEITLKDSATPEDKDQEDKQDERGEEIQEQPQEM ....... so on...from all the 40,000 files.

In reply to parsing sequences from a file by sugar

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.