in reply to Extracting field information from a GenBank file.

Hi all.

Thank you both for the help. In an effort to help others who might find this node, I have included my working code below (assuming of course the format of GenBank files does not change although given the sheer power of perl, modifications should be easy to make):

#!/usr/bin/perl -w use strict; use autodie; my $dna; open FH, '<', 'fluseq.txt'; my $data = do {local $/; <FH>}; if ($data =~ /ORIGIN(.*)/s) { $dna = $1; $dna =~ s/\s+//g; $dna =~ s/\d+//g; $dna =~ s/\/\///; } print $dna; close FH;

Replies are listed 'Best First'.
Re^2: Extracting field information from a GenBank file.
by hdb (Monsignor) on Jul 15, 2013 at 08:43 UTC

    A couple of comments on your code:

    • If the word 'ORIGIN' appears anywhere else in the test, your code would break.
    • If you use $/='ORIGIN' your file would automatically be split at all occurences of this word and you could just use the last bit.
    • Instead of removing all kinds of unwanted characters you could tell Perl to remove everything but a, c, g, and t.
    Most of it is clearly a matter of taste but it feels more direct to me this way:

    use strict; use autodie; open my $fh, '<', 'fluseq.txt'; my @tmp = do {local $/='ORIGIN'; <$fh>}; my $dna = pop @tmp; $dna =~ s/[^acgt]//gi; # delete all but a, c, g, and t print $dna;