Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks, I have many records in a file, and i want to iterate through the file, and do something with each record. e.g. my file looks like this:
>gb|AE008687|:70-1377, Atu5000 ATGTCTGGACGTAAAGCGAGAATCATGTTGTATCTTTGGCGGGCTTTGGGCGGCAAACCGAACCTTGCCC +GCCAAGGGGGCGATATGGCGATAGCGAAGCAGATTGAAGCAACGATCGGTCAAAAGGAAGATGCAGGTG +G >gb|AE008687|:1374-2405, Atu5001 ATGACCAGTAAGTCATCGCGTAAATCCATCGTTGCAAATTTCGGACTGCTGTCGGCGGAGCTTGAAAACC
For example, I want to calculate the number of A's in each record. I thought maybe the input record seperator may help, but i'm not sure how to use it. I have written a piece of code but am not sure how to treat each record individually. Sorry for such a low-level question.
my $in_sequence = 0; my @gene; my $dna; while (<INFILE>) { my $line = $_; if ($line =~ /^>/) { # print $line; $in_sequence = 1; } elsif ($in_sequence) { $dna .= $line; } } print $dna;
AM

Replies are listed 'Best First'.
Re: input record separator help
by nite_man (Deacon) on Mar 15, 2004 at 13:52 UTC

    I've been making small modification of your code:

    while (<INFILE>) { if (/^>/) { $in_sequence = 1; next; } $dna .= $_ if $in_sequence; } }

    Also, I'd like to suggest you look at site of BioPerl if you haven't seen yet it ;)

    ~ Schiller

      thanks for your help Schiller, but I dont get how I can perform my various calculations on each sequence using your code. How can I push each extracted $dna into an array, which I can then iterate through for the calculations? This is the sort of thing I want to do for each sequence:
      for (my $i=0; $i < @genome; $i++) { my $token = $genome[$i] . $genome[$i+1]; $freq{$token}++; $tt = $freq{tt}; $cg = $freq{cg}; $gc = $freq{gc}; $ta = $freq{ta}; $at = $freq{at}; $cc = $freq{cc}; $tg = $freq{tg}; $ag = $freq{ag}; $ac = $freq{ac}; $ga = $freq{ga}; $aa = $freq{aa}; $gg = $freq{gg}; $ca = $freq{ca}; $ct = $freq{ct}; $gt = $freq{gt}; $tc = $freq{tc}; }
      I have tried just using a simple 'push', but this slowed the program down far too much. cheers x
        Anonymous Monk,
        So my earlier assumption that you wanted to throw away non sequence lines was correct. I believe you can avoid an array all together.
        #!/usr/bin/perl use strict; use warnings; my $file = $ARGV[0] || 'default'; open (DNA, '<', $file) or die "Unable to open $file for reading : $!"; my %seq; while ( <DNA> ) { my $dna = <DNA>; my $pairs = "a2" x ((length $dna) / 2); $seq{$_}++ for unpack $pairs , $dna; } print "$_ : $seq{$_}\n" for keys %seq;
        Cheers - L~R
Re: input record separator help
by Limbic~Region (Chancellor) on Mar 15, 2004 at 13:47 UTC
    Anonymous Monk,
    Unfortunately, the input record seperator $/ can not be a regex (at least not in Perl5). It appears that your records are two lines where the second line is what you actually care about. Try something like this:
    while ( <INFILE> ) { my $record = <INFILE>; # stuff goes here }
    This way, each time through the loop $_ contains your information line of the record and $record contains the DNA sequence.

    Cheers - L~R

Re: input record separator help
by helgi (Hermit) on Mar 15, 2004 at 14:49 UTC
    It's probably not a good idea to reinvent the wheel every time you need to program a trivially simple bioinformatics problem that thousands have wrestled with before you.

    Download, install and use Bioperl instead.

    Bioperl on CPAN.

    Bioperl.org


    --
    Regards,
    Helgi Briem
    hbriem AT simnet DOT is

Re: input record separator help
by tinita (Parson) on Mar 15, 2004 at 13:58 UTC
    well, if you always have the > at the beginning, you could do something like that:
    my $dna; { local $/ = "\n>"; $dna = <DATA>; $dna =~ s/^>//; while (<DATA>) { chomp; $dna .= $_; } } chomp $dna;
    update: ok, i forgot you don't want the line starting with > at all. then the approach with reading every second line might be the best, depending on your specification. my example shows you, however, how you can use $/
Re: input record separator help
by Happy-the-monk (Canon) on Mar 15, 2004 at 14:36 UTC

    just for counting the number of occurences of the letter "A" I would leave the input record separator alone. Try:

    $count_A = ( $line =~ tr /A// , "\n" )

    Sören