in reply to parsing question

See $INPUT_RECORD_SEPARATOR for how to read multiline records as single chunks and tr// for a quick way to count characters.

#! perl -slw use strict; $/ = "\n>"; # set record separator while( <DATA> ){ # Split the record in two my( $label, $data ) = m[ (^ [^\n]+ ) \n ( .*$ ) ]sx; # Remove extraneous '>' characters $label =~ s[^>][]; $data =~ s[>$][]s; # remove newlines from the data $data =~ tr[\n][]d; # count the characters # my $count = $data =~ tr[ACGT][ACGT]; # Corrected the omission of + $data from this line. my $count = $data =~ tr[ACGT][ACGT]; print "$label contains $count characters"; } =output P:\test>321523 gb|AE008689|:2073051-2073458, Atu4895 contains 408 characters gb|AE008689|:c2074151-2073549, Atu4896 contains 603 characters gb|AE008689|:c2074749-2074345, Atu4897 contains 405 characters =cut __DATA__ >gb|AE008689|:2073051-2073458, Atu4895 TTGGTCACATATTTTCGCTATTCGAAACCCAAATATCCCTGCGGAACACGTTTTATTGGAGCCGCCGTTT TGCAGTTGCTCGGTGCTGGATTCTTGGCGTTCGTCTTTTGTCTGCTGGATGGGCTCACTGCAAAACCAAC GATAATTCTCGGTCAGTTTGTATCATGCCTAGTGGGGAGCGTCGCCGGCTTTCATTTCGTGGCTTTTCGT CGCCCAGGCACGGACGGCCAACTTTACCTTATCGCGACGTCGCTCTTGGCATTTGGGACCCATTATTGGC TGGTGTCATATTCATTACCTGACCTTTTGCTAGCAAGATTGATTTCAGGATTTGGATCGGGTGTGGTTGT TGCAGGAACTTTCCGACGTCGCTTTCTGGAAAATCCGGTAATTCCCTGCGTTCGATAG >gb|AE008689|:c2074151-2073549, Atu4896 GTGTTTGGACCATATTTTCTGTGCAAACTAAACGATGACATAGGGCGATTTTTAGTGGCGGACAAATACA GACTTCCCGAAGAGTTTTTTACCACTCGGTTTCTCGTTAGACGCATCGTACCCACAGACGCTGAAGCTAT TTTCGAAGGGTGGAACACCGATCCCGAGGTGACGAAGTACCTGACGTGGAAACCCCACTCCGAGCTTGGC CAGACACAGCGGGCGATTGAAGAAAATTATAGTGCGTGGAATGCAGGTACATCGTTTCCAGCTGTCATCT GCCATCGCGAACGGCCACATGAACTAATCGGCCGTATTGATGCACGTCCGATGGGCCACAAGGTCTCTTA CGGGTGGCTTGTCCGAAGAACCTGGTGGGGCCGGGGTGTTGCAAGCGAGGTCGTTCAACTCGCTGTAGAA CACGCGTTATCGCATCCGCGCATCTTTCGCACCGAAGCATCCTGCGACGTTCTGAACACGGCGTCAGCAA GAGTGATGGAAAAAGTAGGGATGACAAAGGAGGCCGTGCTTCGACGGTACCTTTTTCACCCCAATTTTTC GAATATGCCGCGAGACGCCTTCCTGTATTCCAAGGTACGTTAA >gb|AE008689|:c2074749-2074345, Atu4897 ATGAAACATACCATCGCAGTTCTCGGCCTGATCACCTTCTCCAGCCCGGCCTTCGCAGCATCGTGCGAGA AAAACTTCACCGTCTCAGGCGTACCGATGGTCACGGCTGTCTCTTACAAATCCTTTCAGGAACTGCCGAA AGCCAAAGCACCAGCTGTCCTTCAAAAGCTCGCCCAGGCCGTCGCGGCAGAAGGTTTTTCAGGTATCCAG ATCAACAAGGCACTGTCGTCAATCGATGCCCATCAGGAAACCAGCGGAAGTGGCAGGATTCAGACGCTGC GGGTTGTCGCCCGCCAGAAAGGCGCCGCTGTCCGGATCGATGCTGTCTTCAATATTCAGGCAGGACAGAT CGCCGACAAAGACGTCATCCGCAAGGGCATCTGCGACATCATAAAAGGCGCGTAA

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
Timing (and a little luck) are everything!

Replies are listed 'Best First'.
Re: Re: parsing question
by flounder99 (Friar) on Jan 15, 2004 at 12:55 UTC
    You are including the first line when you are counting your characters so you are counting the two A's in
    gb|AE008689|:2073051-2073458, Atu4895
    so your counts are off by 2. If you strip off the first line you will get the right answer. You also don't need the strip off the newlines since tr[ACGT][ACGT] will only count the "ACGT" characters.

    --

    flounder