Extract the matching strings

sundeep has asked for the wisdom of the Perl Monks concerning the following question:

Hi, i am posting the source file content and my intermediate code , i have written.

This is the source file content

LOCUS       YP_001648463             258 aa            linear   INV 17
+-JUN-2009
DEFINITION  cytochrome c oxidase subunit II [Ephydatia muelleri].
ACCESSION   YP_001648463
VERSION     YP_001648463.1  GI:164420795
DBLINK      Project: 28177
DBSOURCE    REFSEQ: accession NC_010202.1
KEYWORDS    .
SOURCE      mitochondrion Ephydatia muelleri
  ORGANISM  Ephydatia muelleri
            Eukaryota; Metazoa; Porifera; Demospongiae; Ceractinomorph
+a;
            Haplosclerida; Spongillidae; Ephydatia.
REFERENCE   1  (residues 1 to 258)
  AUTHORS   Lavrov,D.V., Wang,X. and Kelly,M.
  TITLE     Reconstructing ordinal relationships in the Demospongiae u
+sing
            mitochondrial genomic data
[download]

This is the program , i have written

use strict;
use warnings;

open (PROTEIN,"<invertebrate.protein.gpff") or die $! ; 

my @prot=<PROTEIN>;

my $protlen=$#prot;
my $version="VERSION";
my $dbsource="DBSOURCE";
my $protname;
my $rna;

for(my $i=0;$i<=$protlen;$i++)
{
    if((substr($prot[$i],0,7)) eq $version)
    {
    # have to store the value of "YP_001648463.1" in $protname
    }
    if((substr($prot[$i],0,8)) eq $dbsource)
    {
    # have to store the value of "NC_010202.1" in $rna
    }
}

close PROTEIN;
[download]

My required strings are mentioned in the commented lines of the IF statement.Can someone tell me, how to extract this data and store in those respective scalars

Comment on Extract the matching strings Select or Download Code

Replies are listed 'Best First'.
Re: Extract the matching strings by moritz (Cardinal) on Nov 11, 2010 at 22:17 UTC
You're nearly there. Instead of the comments, just do what the comments say: `# have to store the value of "YP_001648463.1" in $protname $protname = "YP_001648463.1";` [download] If that's not quite what you're looking for, take a look at split. Perl 6 - second systems done right	[reply] [d/l]
Re: Extract the matching strings by oko1 (Deacon) on Nov 11, 2010 at 22:33 UTC
Perhaps you should consider writing a parser for this format - always a good idea if you're going to be dealing with it on a regular basis. Example: `#!/usr/bin/perl -w use strict; open my $Gpff, '<', 'protein.gpff' or die "protein.gpff: $!\n"; my ($key, %data); while (<$Gpff>){ chomp; if (/^(?:\s\s)?([A-Z]+)\s+(.*)$/){ $key = $1; $data{$key} = $2; } else { s/^\s+/ /; $data{$key} .= $_; } } close $Gpff;` [download] Given the above, you can now easily extract the data that you want by its label; that is, printing $data{SOURCE} will output "mitochondrion Ephydatia muelleri". You may need to adjust the parser to suit your exact application, but this should give you a good start. -- "Language shapes the way we think, and determines what we can think about." -- B. L. Whorf	[reply] [d/l]
Re: Extract the matching strings by poulhs (Beadle) on Nov 11, 2010 at 23:18 UTC
something like: `LINE: while ( <PROTEIN> ) { if ( /^VERSION\s+(\S+)/ ) { # extracts the first non-space sequence after the VERSION-token $protname = $1; next LINE; } if ( /^DBSOURCE\s+.\s(\S+)\s$/ ) { # extracts the last non-space sequence on the DBSOURCE line $rna = $1; next LINE; } }` [download] You first need to determine the syntax of the lines you want, and the location of the values you want to extract: Is the "YP_001648463.1" always the first field after the VERSION? Is the RNA always last on the DBSOURCE lines? You should consider what to do with invalid input: what happens if DBSOURCE is not present or $protname does not match the ACCESSION-value... For more info on regular expressions, check out `perldoc perlre`.	[reply] [d/l] [select]
Re: Extract the matching strings by Marshall (Canon) on Nov 12, 2010 at 08:50 UTC
From what you describe, it doesn't appear that a general solution is necessary in terms of parsing this - you only are asking for 2 parameters. I would keep it simple and add more complex stuff when needed. You don't say whether there are multiple records like this in the file or whether this is some kind of "header per file". If there are multiple records like this in the file, then some modifications are needed. But basically look for the keywords and if found, use split to extract either the 2nd thing on the line or the last thing on the line. All sorts of complex things are possible, but if you don't need them, then don't do it! #!/usr/bin/perl -w use strict; my $protname; my $rna; while (<DATA>) #while (<PROTEIN>) in your case { if (/^VERSION/) { $protname = (split)[1]; } elsif (/^DBSOURCE/) { $rna = (split)[-1]; } } print "protname = $protname\n"; print "rna = $rna\n"; __END__ PRINTS: protname = YP_001648463.1 rna = NC_010202.1 __DATA__ LOCUS YP_001648463 258 aa linear INV 17 +-JUN-2009 DEFINITION cytochrome c oxidase subunit II [Ephydatia muelleri]. ACCESSION YP_001648463 VERSION YP_001648463.1 GI:164420795 DBLINK Project: 28177 DBSOURCE REFSEQ: accession NC_010202.1 KEYWORDS . .... [download]	[reply] [d/l]
Re: Extract the matching strings by aquarium (Curate) on Nov 12, 2010 at 04:16 UTC
i don't see a end of record marker..or you could use just the start of record marker (LOCUS?)...to process the records in 2 pass fashion. on first pass you initialize and assign a temporary hash structure, with LOCUS, DEFINITION, etc as keys and whatever remains becomes value for each such key. when you hit a new record and prior to processing that (or when EOF) process each value (still just a string) of the temporary hash into an appropriately more elaborate structure as required. then you have the option of either processing the fully fledged data structure record by record, or grow it and process according to whatever business rules at the end. this separates the logic for subfield processing (business rules) from the routine record processing on input. does away with very convoluted code anyway..code that parses input and processes record at field and subfield levels all at once the hardest line to type correctly is: stty erase ^H	[reply]