input record separator help

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks, I have many records in a file, and i want to iterate through the file, and do something with each record. e.g. my file looks like this:

>gb|AE008687|:70-1377, Atu5000
ATGTCTGGACGTAAAGCGAGAATCATGTTGTATCTTTGGCGGGCTTTGGGCGGCAAACCGAACCTTGCCC
+GCCAAGGGGGCGATATGGCGATAGCGAAGCAGATTGAAGCAACGATCGGTCAAAAGGAAGATGCAGGTG
+G
>gb|AE008687|:1374-2405, Atu5001
ATGACCAGTAAGTCATCGCGTAAATCCATCGTTGCAAATTTCGGACTGCTGTCGGCGGAGCTTGAAAACC
[download]

For example, I want to calculate the number of A's in each record. I thought maybe the input record seperator may help, but i'm not sure how to use it. I have written a piece of code but am not sure how to treat each record individually. Sorry for such a low-level question.

my $in_sequence = 0;
my @gene;
my $dna;

while (<INFILE>)   {

       my $line = $_;
       if ($line =~ /^>/)   {
            # print $line;
            $in_sequence = 1;
      }

      elsif ($in_sequence)  {
        $dna .= $line;
              }
}

print $dna;
[download]

Comment on input record separator help Select or Download Code

Replies are listed 'Best First'.
Re: input record separator help by nite_man (Deacon) on Mar 15, 2004 at 13:52 UTC
I've been making small modification of your code: `while (<INFILE>) { if (/^>/) { $in_sequence = 1; next; } $dna .= $_ if $in_sequence; } }` [download] Also, I'd like to suggest you look at site of BioPerl if you haven't seen yet it ;) ~ Schiller	[reply] [d/l]
Re: Re: input record separator help by Anonymous Monk on Mar 15, 2004 at 14:29 UTC
thanks for your help Schiller, but I dont get how I can perform my various calculations on each sequence using your code. How can I push each extracted $dna into an array, which I can then iterate through for the calculations? This is the sort of thing I want to do for each sequence: `for (my $i=0; $i < @genome; $i++) { my $token = $genome[$i] . $genome[$i+1]; $freq{$token}++; $tt = $freq{tt}; $cg = $freq{cg}; $gc = $freq{gc}; $ta = $freq{ta}; $at = $freq{at}; $cc = $freq{cc}; $tg = $freq{tg}; $ag = $freq{ag}; $ac = $freq{ac}; $ga = $freq{ga}; $aa = $freq{aa}; $gg = $freq{gg}; $ca = $freq{ca}; $ct = $freq{ct}; $gt = $freq{gt}; $tc = $freq{tc}; }` [download] I have tried just using a simple 'push', but this slowed the program down far too much. cheers x	[reply] [d/l]
Re: Re: Re: input record separator help by Limbic~Region (Chancellor) on Mar 15, 2004 at 14:53 UTC
Anonymous Monk, So my earlier assumption that you wanted to throw away non sequence lines was correct. I believe you can avoid an array all together. `#!/usr/bin/perl use strict; use warnings; my $file = $ARGV[0] \|\| 'default'; open (DNA, '<', $file) or die "Unable to open $file for reading : $!"; my %seq; while ( <DNA> ) { my $dna = <DNA>; my $pairs = "a2" x ((length $dna) / 2); $seq{$_}++ for unpack $pairs , $dna; } print "$_ : $seq{$_}\n" for keys %seq;` [download] Cheers - L~R	[reply] [d/l]
Re: Re: Re: Re: input record separator help by Anonymous Monk on Mar 15, 2004 at 15:00 UTC
Re: Re: Re: Re: Re: input record separator help by Limbic~Region (Chancellor) on Mar 15, 2004 at 15:11 UTC
Re: input record separator help by Limbic~Region (Chancellor) on Mar 15, 2004 at 13:47 UTC
Anonymous Monk, Unfortunately, the input record seperator $/ can not be a regex (at least not in Perl5). It appears that your records are two lines where the second line is what you actually care about. Try something like this: `while ( <INFILE> ) { my $record = <INFILE>; # stuff goes here }` [download] This way, each time through the loop $_ contains your information line of the record and $record contains the DNA sequence. Cheers - L~R	[reply] [d/l]
Re: input record separator help by helgi (Hermit) on Mar 15, 2004 at 14:49 UTC
It's probably not a good idea to reinvent the wheel every time you need to program a trivially simple bioinformatics problem that thousands have wrestled with before you. Download, install and use Bioperl instead. Bioperl on CPAN. Bioperl.org -- Regards, Helgi Briem hbriem AT simnet DOT is	[reply]
Re: input record separator help by tinita (Parson) on Mar 15, 2004 at 13:58 UTC
well, if you always have the > at the beginning, you could do something like that: `my $dna; { local $/ = "\n>"; $dna = <DATA>; $dna =~ s/^>//; while (<DATA>) { chomp; $dna .= $_; } } chomp $dna;` [download] update: ok, i forgot you don't want the line starting with > at all. then the approach with reading every second line might be the best, depending on your specification. my example shows you, however, how you can use $/	[reply] [d/l]
Re: input record separator help by Happy-the-monk (Canon) on Mar 15, 2004 at 14:36 UTC
just for counting the number of occurences of the letter "A" I would leave the input record separator alone. Try: `$count_A = ( $line =~ tr /A// , "\n" )` Sören	[reply] [d/l]