sundeep has asked for the wisdom of the Perl Monks concerning the following question:

Hi, i am posting the source file content and my intermediate code , i have written.

This is the source file content

LOCUS YP_001648463 258 aa linear INV 17 +-JUN-2009 DEFINITION cytochrome c oxidase subunit II [Ephydatia muelleri]. ACCESSION YP_001648463 VERSION YP_001648463.1 GI:164420795 DBLINK Project: 28177 DBSOURCE REFSEQ: accession NC_010202.1 KEYWORDS . SOURCE mitochondrion Ephydatia muelleri ORGANISM Ephydatia muelleri Eukaryota; Metazoa; Porifera; Demospongiae; Ceractinomorph +a; Haplosclerida; Spongillidae; Ephydatia. REFERENCE 1 (residues 1 to 258) AUTHORS Lavrov,D.V., Wang,X. and Kelly,M. TITLE Reconstructing ordinal relationships in the Demospongiae u +sing mitochondrial genomic data

This is the program , i have written

use strict; use warnings; open (PROTEIN,"<invertebrate.protein.gpff") or die $! ; my @prot=<PROTEIN>; my $protlen=$#prot; my $version="VERSION"; my $dbsource="DBSOURCE"; my $protname; my $rna; for(my $i=0;$i<=$protlen;$i++) { if((substr($prot[$i],0,7)) eq $version) { # have to store the value of "YP_001648463.1" in $protname } if((substr($prot[$i],0,8)) eq $dbsource) { # have to store the value of "NC_010202.1" in $rna } } close PROTEIN;

My required strings are mentioned in the commented lines of the IF statement.Can someone tell me, how to extract this data and store in those respective scalars

Replies are listed 'Best First'.
Re: Extract the matching strings
by moritz (Cardinal) on Nov 11, 2010 at 22:17 UTC

    You're nearly there. Instead of the comments, just do what the comments say:

    # have to store the value of "YP_001648463.1" in $protname $protname = "YP_001648463.1";

    If that's not quite what you're looking for, take a look at split.

Re: Extract the matching strings
by oko1 (Deacon) on Nov 11, 2010 at 22:33 UTC

    Perhaps you should consider writing a parser for this format - always a good idea if you're going to be dealing with it on a regular basis. Example:

    #!/usr/bin/perl -w use strict; open my $Gpff, '<', 'protein.gpff' or die "protein.gpff: $!\n"; my ($key, %data); while (<$Gpff>){ chomp; if (/^(?:\s\s)?([A-Z]+)\s+(.*)$/){ $key = $1; $data{$key} = $2; } else { s/^\s+/ /; $data{$key} .= $_; } } close $Gpff;

    Given the above, you can now easily extract the data that you want by its label; that is, printing $data{SOURCE} will output "mitochondrion Ephydatia muelleri". You may need to adjust the parser to suit your exact application, but this should give you a good start.


    --
    "Language shapes the way we think, and determines what we can think about."
    -- B. L. Whorf
Re: Extract the matching strings
by poulhs (Beadle) on Nov 11, 2010 at 23:18 UTC

    something like:

    LINE: while ( <PROTEIN> ) { if ( /^VERSION\s+(\S+)/ ) { # extracts the first non-space sequence after the VERSION-token $protname = $1; next LINE; } if ( /^DBSOURCE\s+.*\s(\S+)\s*$/ ) { # extracts the last non-space sequence on the DBSOURCE line $rna = $1; next LINE; } }
    You first need to determine the syntax of the lines you want, and the location of the values you want to extract:
    Is the "YP_001648463.1" always the first field after the VERSION? Is the RNA always last on the DBSOURCE lines?
    You should consider what to do with invalid input: what happens if DBSOURCE is not present or $protname does not match the ACCESSION-value...
    For more info on regular expressions, check out perldoc perlre.

Re: Extract the matching strings
by Marshall (Canon) on Nov 12, 2010 at 08:50 UTC
    From what you describe, it doesn't appear that a general solution is necessary in terms of parsing this - you only are asking for 2 parameters. I would keep it simple and add more complex stuff when needed. You don't say whether there are multiple records like this in the file or whether this is some kind of "header per file". If there are multiple records like this in the file, then some modifications are needed.

    But basically look for the keywords and if found, use split to extract either the 2nd thing on the line or the last thing on the line. All sorts of complex things are possible, but if you don't need them, then don't do it!

    #!/usr/bin/perl -w use strict; my $protname; my $rna; while (<DATA>) #while (<PROTEIN>) in your case { if (/^VERSION/) { $protname = (split)[1]; } elsif (/^DBSOURCE/) { $rna = (split)[-1]; } } print "protname = $protname\n"; print "rna = $rna\n"; __END__ PRINTS: protname = YP_001648463.1 rna = NC_010202.1 __DATA__ LOCUS YP_001648463 258 aa linear INV 17 +-JUN-2009 DEFINITION cytochrome c oxidase subunit II [Ephydatia muelleri]. ACCESSION YP_001648463 VERSION YP_001648463.1 GI:164420795 DBLINK Project: 28177 DBSOURCE REFSEQ: accession NC_010202.1 KEYWORDS . ....
Re: Extract the matching strings
by aquarium (Curate) on Nov 12, 2010 at 04:16 UTC
    i don't see a end of record marker..or you could use just the start of record marker (LOCUS?)...to process the records in 2 pass fashion. on first pass you initialize and assign a temporary hash structure, with LOCUS, DEFINITION, etc as keys and whatever remains becomes value for each such key. when you hit a new record and prior to processing that (or when EOF) process each value (still just a string) of the temporary hash into an appropriately more elaborate structure as required. then you have the option of either processing the fully fledged data structure record by record, or grow it and process according to whatever business rules at the end. this separates the logic for subfield processing (business rules) from the routine record processing on input. does away with very convoluted code anyway..code that parses input and processes record at field and subfield levels all at once
    the hardest line to type correctly is: stty erase ^H