Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks, I am trying to learn perl and am starting by trying to parse files. Anyway, I simply want to count the number of characters in each gene in a file but am getting stuck. I wondered if someone may be able to help? The file looks like this: I want to count the sequence characters seperated by each >gb|..... line. Many thanks
>gb|AE008689|:2073051-2073458, Atu4895 TTGGTCACATATTTTCGCTATTCGAAACCCAAATATCCCTGCGGAACACGTTTTATTGGAGCCGCCGTTT TGCAGTTGCTCGGTGCTGGATTCTTGGCGTTCGTCTTTTGTCTGCTGGATGGGCTCACTGCAAAACCAAC GATAATTCTCGGTCAGTTTGTATCATGCCTAGTGGGGAGCGTCGCCGGCTTTCATTTCGTGGCTTTTCGT CGCCCAGGCACGGACGGCCAACTTTACCTTATCGCGACGTCGCTCTTGGCATTTGGGACCCATTATTGGC TGGTGTCATATTCATTACCTGACCTTTTGCTAGCAAGATTGATTTCAGGATTTGGATCGGGTGTGGTTGT TGCAGGAACTTTCCGACGTCGCTTTCTGGAAAATCCGGTAATTCCCTGCGTTCGATAG >gb|AE008689|:c2074151-2073549, Atu4896 GTGTTTGGACCATATTTTCTGTGCAAACTAAACGATGACATAGGGCGATTTTTAGTGGCGGACAAATACA GACTTCCCGAAGAGTTTTTTACCACTCGGTTTCTCGTTAGACGCATCGTACCCACAGACGCTGAAGCTAT TTTCGAAGGGTGGAACACCGATCCCGAGGTGACGAAGTACCTGACGTGGAAACCCCACTCCGAGCTTGGC CAGACACAGCGGGCGATTGAAGAAAATTATAGTGCGTGGAATGCAGGTACATCGTTTCCAGCTGTCATCT GCCATCGCGAACGGCCACATGAACTAATCGGCCGTATTGATGCACGTCCGATGGGCCACAAGGTCTCTTA CGGGTGGCTTGTCCGAAGAACCTGGTGGGGCCGGGGTGTTGCAAGCGAGGTCGTTCAACTCGCTGTAGAA CACGCGTTATCGCATCCGCGCATCTTTCGCACCGAAGCATCCTGCGACGTTCTGAACACGGCGTCAGCAA GAGTGATGGAAAAAGTAGGGATGACAAAGGAGGCCGTGCTTCGACGGTACCTTTTTCACCCCAATTTTTC GAATATGCCGCGAGACGCCTTCCTGTATTCCAAGGTACGTTAA >gb|AE008689|:c2074749-2074345, Atu4897 ATGAAACATACCATCGCAGTTCTCGGCCTGATCACCTTCTCCAGCCCGGCCTTCGCAGCATCGTGCGAGA AAAACTTCACCGTCTCAGGCGTACCGATGGTCACGGCTGTCTCTTACAAATCCTTTCAGGAACTGCCGAA AGCCAAAGCACCAGCTGTCCTTCAAAAGCTCGCCCAGGCCGTCGCGGCAGAAGGTTTTTCAGGTATCCAG ATCAACAAGGCACTGTCGTCAATCGATGCCCATCAGGAAACCAGCGGAAGTGGCAGGATTCAGACGCTGC GGGTTGTCGCCCGCCAGAAAGGCGCCGCTGTCCGGATCGATGCTGTCTTCAATATTCAGGCAGGACAGAT CGCCGACAAAGACGTCATCCGCAAGGGCATCTGCGACATCATAAAAGGCGCGTAA
and this is where i've got to with the code...
#! /usr/bin/perl -w use strict; my $num_of_params; $num_of_params = @ARGV; if ($num_of_params < 2) { die ("\nYou haven't entered enough parameters!\n\nTo run the progr +am type:\n\t\tperl gc3s_content.pl infile.ffn outfile.gc3s \n\n\t\tWh +ere window_size is the size of each window for which the program will + calculate results\n"); } open (INFILE, $ARGV[0]) or die "unable to open file $!"; open (OUTFILE,">$ARGV[1]") or die "unable to open file $!"; my @fasta = <INFILE>; my @gene; my $gene; my $flag; foreach my $line (@fasta) { if ($line =~ /^>/) { # print "$line\n"; $flag = 0; } # elsif ($line !~ /^>/) { elsif ($line =~ /^\w+/) { $gene = $line; $flag = 1; } if ($flag) { chomp ($gene); $gene .= $line; } } print "$gene\n";

Replies are listed 'Best First'.
Re: parsing question
by BrowserUk (Patriarch) on Jan 15, 2004 at 12:45 UTC

    See $INPUT_RECORD_SEPARATOR for how to read multiline records as single chunks and tr// for a quick way to count characters.

    #! perl -slw use strict; $/ = "\n>"; # set record separator while( <DATA> ){ # Split the record in two my( $label, $data ) = m[ (^ [^\n]+ ) \n ( .*$ ) ]sx; # Remove extraneous '>' characters $label =~ s[^>][]; $data =~ s[>$][]s; # remove newlines from the data $data =~ tr[\n][]d; # count the characters # my $count = $data =~ tr[ACGT][ACGT]; # Corrected the omission of + $data from this line. my $count = $data =~ tr[ACGT][ACGT]; print "$label contains $count characters"; } =output P:\test>321523 gb|AE008689|:2073051-2073458, Atu4895 contains 408 characters gb|AE008689|:c2074151-2073549, Atu4896 contains 603 characters gb|AE008689|:c2074749-2074345, Atu4897 contains 405 characters =cut __DATA__ >gb|AE008689|:2073051-2073458, Atu4895 TTGGTCACATATTTTCGCTATTCGAAACCCAAATATCCCTGCGGAACACGTTTTATTGGAGCCGCCGTTT TGCAGTTGCTCGGTGCTGGATTCTTGGCGTTCGTCTTTTGTCTGCTGGATGGGCTCACTGCAAAACCAAC GATAATTCTCGGTCAGTTTGTATCATGCCTAGTGGGGAGCGTCGCCGGCTTTCATTTCGTGGCTTTTCGT CGCCCAGGCACGGACGGCCAACTTTACCTTATCGCGACGTCGCTCTTGGCATTTGGGACCCATTATTGGC TGGTGTCATATTCATTACCTGACCTTTTGCTAGCAAGATTGATTTCAGGATTTGGATCGGGTGTGGTTGT TGCAGGAACTTTCCGACGTCGCTTTCTGGAAAATCCGGTAATTCCCTGCGTTCGATAG >gb|AE008689|:c2074151-2073549, Atu4896 GTGTTTGGACCATATTTTCTGTGCAAACTAAACGATGACATAGGGCGATTTTTAGTGGCGGACAAATACA GACTTCCCGAAGAGTTTTTTACCACTCGGTTTCTCGTTAGACGCATCGTACCCACAGACGCTGAAGCTAT TTTCGAAGGGTGGAACACCGATCCCGAGGTGACGAAGTACCTGACGTGGAAACCCCACTCCGAGCTTGGC CAGACACAGCGGGCGATTGAAGAAAATTATAGTGCGTGGAATGCAGGTACATCGTTTCCAGCTGTCATCT GCCATCGCGAACGGCCACATGAACTAATCGGCCGTATTGATGCACGTCCGATGGGCCACAAGGTCTCTTA CGGGTGGCTTGTCCGAAGAACCTGGTGGGGCCGGGGTGTTGCAAGCGAGGTCGTTCAACTCGCTGTAGAA CACGCGTTATCGCATCCGCGCATCTTTCGCACCGAAGCATCCTGCGACGTTCTGAACACGGCGTCAGCAA GAGTGATGGAAAAAGTAGGGATGACAAAGGAGGCCGTGCTTCGACGGTACCTTTTTCACCCCAATTTTTC GAATATGCCGCGAGACGCCTTCCTGTATTCCAAGGTACGTTAA >gb|AE008689|:c2074749-2074345, Atu4897 ATGAAACATACCATCGCAGTTCTCGGCCTGATCACCTTCTCCAGCCCGGCCTTCGCAGCATCGTGCGAGA AAAACTTCACCGTCTCAGGCGTACCGATGGTCACGGCTGTCTCTTACAAATCCTTTCAGGAACTGCCGAA AGCCAAAGCACCAGCTGTCCTTCAAAAGCTCGCCCAGGCCGTCGCGGCAGAAGGTTTTTCAGGTATCCAG ATCAACAAGGCACTGTCGTCAATCGATGCCCATCAGGAAACCAGCGGAAGTGGCAGGATTCAGACGCTGC GGGTTGTCGCCCGCCAGAAAGGCGCCGCTGTCCGGATCGATGCTGTCTTCAATATTCAGGCAGGACAGAT CGCCGACAAAGACGTCATCCGCAAGGGCATCTGCGACATCATAAAAGGCGCGTAA

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    Timing (and a little luck) are everything!

      You are including the first line when you are counting your characters so you are counting the two A's in
      gb|AE008689|:2073051-2073458, Atu4895
      so your counts are off by 2. If you strip off the first line you will get the right answer. You also don't need the strip off the newlines since tr[ACGT][ACGT] will only count the "ACGT" characters.

      --

      flounder

Re: parsing question
by Roger (Parson) on Jan 15, 2004 at 12:49 UTC
    An alternative solution.

    use strict; use warnings; use Data::Dumper; my %seq_count; local $/ = '>gb|'; while (my $seq = <DATA>) { $seq =~ s/>gb\|//; next if ! length($seq); # get sequence id my ($seq_id) = $seq =~ /\|:(.*)/; # remove anything other than sequence characters $seq =~ s/^.*\n|\n//g; # build hash of sequence count $seq_count{$seq_id} = length($seq); } print Dumper(\%seq_count); __DATA__ >gb|AE008689|:2073051-2073458, Atu4895 TTGGTCACATATTTTCGCTATTCGAAACCCAAATATCCCTGCGGAACACGTTTTATTGGAGCCGCCGTTT TGCAGTTGCTCGGTGCTGGATTCTTGGCGTTCGTCTTTTGTCTGCTGGATGGGCTCACTGCAAAACCAAC GATAATTCTCGGTCAGTTTGTATCATGCCTAGTGGGGAGCGTCGCCGGCTTTCATTTCGTGGCTTTTCGT CGCCCAGGCACGGACGGCCAACTTTACCTTATCGCGACGTCGCTCTTGGCATTTGGGACCCATTATTGGC TGGTGTCATATTCATTACCTGACCTTTTGCTAGCAAGATTGATTTCAGGATTTGGATCGGGTGTGGTTGT TGCAGGAACTTTCCGACGTCGCTTTCTGGAAAATCCGGTAATTCCCTGCGTTCGATAG >gb|AE008689|:c2074151-2073549, Atu4896 GTGTTTGGACCATATTTTCTGTGCAAACTAAACGATGACATAGGGCGATTTTTAGTGGCGGACAAATACA GACTTCCCGAAGAGTTTTTTACCACTCGGTTTCTCGTTAGACGCATCGTACCCACAGACGCTGAAGCTAT TTTCGAAGGGTGGAACACCGATCCCGAGGTGACGAAGTACCTGACGTGGAAACCCCACTCCGAGCTTGGC CAGACACAGCGGGCGATTGAAGAAAATTATAGTGCGTGGAATGCAGGTACATCGTTTCCAGCTGTCATCT GCCATCGCGAACGGCCACATGAACTAATCGGCCGTATTGATGCACGTCCGATGGGCCACAAGGTCTCTTA CGGGTGGCTTGTCCGAAGAACCTGGTGGGGCCGGGGTGTTGCAAGCGAGGTCGTTCAACTCGCTGTAGAA CACGCGTTATCGCATCCGCGCATCTTTCGCACCGAAGCATCCTGCGACGTTCTGAACACGGCGTCAGCAA GAGTGATGGAAAAAGTAGGGATGACAAAGGAGGCCGTGCTTCGACGGTACCTTTTTCACCCCAATTTTTC GAATATGCCGCGAGACGCCTTCCTGTATTCCAAGGTACGTTAA >gb|AE008689|:c2074749-2074345, Atu4897 ATGAAACATACCATCGCAGTTCTCGGCCTGATCACCTTCTCCAGCCCGGCCTTCGCAGCATCGTGCGAGA AAAACTTCACCGTCTCAGGCGTACCGATGGTCACGGCTGTCTCTTACAAATCCTTTCAGGAACTGCCGAA AGCCAAAGCACCAGCTGTCCTTCAAAAGCTCGCCCAGGCCGTCGCGGCAGAAGGTTTTTCAGGTATCCAG ATCAACAAGGCACTGTCGTCAATCGATGCCCATCAGGAAACCAGCGGAAGTGGCAGGATTCAGACGCTGC GGGTTGTCGCCCGCCAGAAAGGCGCCGCTGTCCGGATCGATGCTGTCTTCAATATTCAGGCAGGACAGAT CGCCGACAAAGACGTCATCCGCAAGGGCATCTGCGACATCATAAAAGGCGCGTAA
    And the output -
    $VAR1 = { 'c2074151-2073549, Atu4896' => 603, '2073051-2073458, Atu4895' => 408, 'c2074749-2074345, Atu4897' => 405 };
Re: parsing question
by chimni (Pilgrim) on Jan 15, 2004 at 12:28 UTC

    I know this can be done in one line with a regex and someone will probably blow my solution out of the water)but
    my $x=">gb|AE004895>gb|TTGGT>gb|"; my @gene= split(/>gb\|/,$x); foreach my $tmp (@gene) { if($tmp =~ m/\w+/) { print "Length"."=".length($tmp)."\n"; } }
    use split on the ">gb|" delimeter ."|" is a special char so escape and use length
    HTH
Re: parsing question
by flounder99 (Friar) on Jan 15, 2004 at 12:46 UTC
    You can use length to get the length of a string. You just need to make sure you get rid of the newlines or you could use tr/ACGT/ACGT/

    This might point you in the right direction:

    use strict; use warnings; my ($chars, $gene); for (<DATA>) { chomp; #get rid of newline if (/^\>/) { print "$gene => $chars chars\n" if $chars; $chars=0; $gene = $_; } else { $chars += length; #$chars += tr/AGCT/AGCT/; #would work also } } print "$gene => $chars chars\n" if $chars; #print the last gene at the + end of the file __DATA__ >gb|AE008689|:2073051-2073458, Atu4895 TTGGTCACATATTTTCGCTATTCGAAACCCAAATATCCCTGCGGAACACGTTTTATTGGAGCCGCCGTTT TGCAGTTGCTCGGTGCTGGATTCTTGGCGTTCGTCTTTTGTCTGCTGGATGGGCTCACTGCAAAACCAAC GATAATTCTCGGTCAGTTTGTATCATGCCTAGTGGGGAGCGTCGCCGGCTTTCATTTCGTGGCTTTTCGT CGCCCAGGCACGGACGGCCAACTTTACCTTATCGCGACGTCGCTCTTGGCATTTGGGACCCATTATTGGC TGGTGTCATATTCATTACCTGACCTTTTGCTAGCAAGATTGATTTCAGGATTTGGATCGGGTGTGGTTGT TGCAGGAACTTTCCGACGTCGCTTTCTGGAAAATCCGGTAATTCCCTGCGTTCGATAG >gb|AE008689|:c2074151-2073549, Atu4896 GTGTTTGGACCATATTTTCTGTGCAAACTAAACGATGACATAGGGCGATTTTTAGTGGCGGACAAATACA GACTTCCCGAAGAGTTTTTTACCACTCGGTTTCTCGTTAGACGCATCGTACCCACAGACGCTGAAGCTAT TTTCGAAGGGTGGAACACCGATCCCGAGGTGACGAAGTACCTGACGTGGAAACCCCACTCCGAGCTTGGC CAGACACAGCGGGCGATTGAAGAAAATTATAGTGCGTGGAATGCAGGTACATCGTTTCCAGCTGTCATCT GCCATCGCGAACGGCCACATGAACTAATCGGCCGTATTGATGCACGTCCGATGGGCCACAAGGTCTCTTA CGGGTGGCTTGTCCGAAGAACCTGGTGGGGCCGGGGTGTTGCAAGCGAGGTCGTTCAACTCGCTGTAGAA CACGCGTTATCGCATCCGCGCATCTTTCGCACCGAAGCATCCTGCGACGTTCTGAACACGGCGTCAGCAA GAGTGATGGAAAAAGTAGGGATGACAAAGGAGGCCGTGCTTCGACGGTACCTTTTTCACCCCAATTTTTC GAATATGCCGCGAGACGCCTTCCTGTATTCCAAGGTACGTTAA >gb|AE008689|:c2074749-2074345, Atu4897 ATGAAACATACCATCGCAGTTCTCGGCCTGATCACCTTCTCCAGCCCGGCCTTCGCAGCATCGTGCGAGA AAAACTTCACCGTCTCAGGCGTACCGATGGTCACGGCTGTCTCTTACAAATCCTTTCAGGAACTGCCGAA AGCCAAAGCACCAGCTGTCCTTCAAAAGCTCGCCCAGGCCGTCGCGGCAGAAGGTTTTTCAGGTATCCAG ATCAACAAGGCACTGTCGTCAATCGATGCCCATCAGGAAACCAGCGGAAGTGGCAGGATTCAGACGCTGC GGGTTGTCGCCCGCCAGAAAGGCGCCGCTGTCCGGATCGATGCTGTCTTCAATATTCAGGCAGGACAGAT CGCCGACAAAGACGTCATCCGCAAGGGCATCTGCGACATCATAAAAGGCGCGTAA
    outputs
    >gb|AE008689|:2073051-2073458, Atu4895 => 408 chars >gb|AE008689|:c2074151-2073549, Atu4896 => 603 chars >gb|AE008689|:c2074749-2074345, Atu4897 => 405 chars

    --

    flounder

Re: parsing question
by alexg (Beadle) on Jan 15, 2004 at 14:44 UTC
    If you want to do bioinformatics with Perl, may I recommend getting familiar with Bioperl (http://bioperl.org/). It provides modules for FASTA format parsing (and many other formats. Your homework is then very easy :)
    use Bio::SeqIO; $in = Bio::SeqIO->new(-file => "inputfilename",-format => 'Fasta'); while ( my $seq = $in->next_seq() ) { $seqstr = $seq->seq(); # actual sequence as a string print $seqobj->display_id(); # human readable id of the sequence print length($seqstr); # length of sequence }
    Code is untested and largely lifted from the Bioperl tutorial.
Re: parsing question
by Hena (Friar) on Jan 15, 2004 at 12:50 UTC
    UPDATE:
    I actually read the problem wrong, and that was answered by a lot of people by now... so ignore this unless you want to count the nucleotide amounts :)

    Heres a code, that requires the user to give a filename in commandline. You should be able to use the sub within your own code as well.
    #!/usr/bin/perl use warnings; use strict; my $inf=shift @ARGV || or die "Need to give a input file.\n"; open (INF,shift @ARGV); sub print_map (\$\$) { my $name_r=shift @_; my $code_r=shift @_; my %hash; print "$$name_r\n"; while ($$code_r=~m/(\w)/g) { $hash{$1}++;} foreach (sort {$a cmp $b} keys %hash) { print "$_: $hash{$_}\n"; } } my ($code,$name); while (<INF>) { chomp; if (m/^>(.+)/) { print_map ($name,$code) if ($name); $name=$1; } else { $code.=uc($_); } } print_map($name,$code); exit;
Re: parsing question
by mr_mischief (Monsignor) on Jan 15, 2004 at 15:03 UTC
    my $num_chars = 0; <>; while ( <> ) { do {print "$num_chars\n"; $num_chars = 0; next} if /^>/; $num_chars += tr/ACGT/ACGT/; } print "$num_chars\n";
    or
    my @num_chars = (); my $i = -1; while ( <> ) { $i++, next if /^>/; $num_chars[$i] += tr/ACGT/ACGT/; } foreach ( @num_chars ) { print "$_\n"; }
    both seem to work for me. That is, for each group of lines separated by lines startting with '>', they each accurately print the number of [ACGT] which appear in that group.

    The first gets by with one declared scalar and one loop. The second requires more memory since it uses an array instead of a single scalar. That array is useful, though, if you want to do anything with the values other than just print them. The second also uses two loops instead of one, which could be a speed factor on large data sets. The second loop could be removed easily enough.

    On the loop fixups:
    Since someone (not here on PerlMonks) recently asked me why I tend to use loop fixups instead of additional tests within a loop... The first has a pre-loop fixup involving input and a post-loop fixup involving output. The second initializes the array index to -1 instead of to zero before the loop. One might wonder why I used these fixups since they may not be quite as clear to some as an additional test inside the loop. The loops are cluttered enough as it is, and sometimes visual clarity makes up for logical clarity. I find them simpler to debug when compared to something which makes an additional flow control changes, especially inside a loop. This way the loops stay more consistent and focus on the general case in which we're interested. I find this better than having the special-case code folded into the general-case code when I try to analyze a program's behavior. Also, by making these simple and easily understood actions happen outside the loop, you could see notable performance differences on large data sets. If the printing and resetting of the counter from the first example get more complex, they could be put into a single subroutine, and that subroutine could be called both from within the loop and as the last output after the loop, making the final output a single entity to debug along with the others. So it's not just optimization. I actually find it easier to do things this way. YMMV.



    Christopher E. Stith