file parsing

AWallBuilder has asked for the wisdom of the Perl Monks concerning the following question:

Dear all, This is related to an earlier post, and I have made progress, but I am stuck again. I am parsing a file (partial example below), and reading into a hash %values. However, I want to extract info from the Ecogene line (eg. DBLINKS - (ECOGENE)). I think my %values hash will not have separate vale/record pairs for the 5 different DBLINKS. Is there anyway I can just extract the one for ECOGENE. I tried one way in my code, but it is wrong as I realize the problem is in reading into the hash. Any help appreciated.

portion of input file

//
UNIQUE-ID - EG11751
TYPES - BC-5.5.2
TYPES - BC-1.7.9
TYPES - BC-5.5.1
COMMON-NAME - otsA
ACCESSION-1 - b1896
ACCESSION-2 - ECK1895
CENTISOME-POSITION - 42.636864    
COMMENT-INTERNAL - 1/24/05 keseler removed pexA as synonym
COMPONENT-OF - COLI-K12-39
COMPONENT-OF - TU0-7722
COMPONENT-OF - TU00391
COMPONENT-OF - TU00312
DBLINKS - (ECOLIHUB "otsA" NIL |kr| 3474243543 NIL NIL)
DBLINKS - (REGULONDB "EG11751" NIL |kr| 3462030648 NIL NIL)
DBLINKS - (ASAP "ABE-0006318" NIL |paley| 3398447608 NIL NIL)
DBLINKS - (ECHOBASE "EB1701" NIL |pkarp| 3346767936 NIL NIL)
DBLINKS - (ECOGENE "EG11751" NIL |pick| 3292798423 NIL NIL)
DBLINKS - (OU-MICROARRAY "b1896" NIL NIL NIL NIL NIL)
DBLINKS - (CGSC "18073" NIL |pkarp| 3035559680 NIL NIL)
KNOCKOUT-GROWTH-OBSERVATIONS - OBS0-40
KNOCKOUT-GROWTH-OBSERVATIONS - OBS0-37
KNOCKOUT-GROWTH-OBSERVATIONS - OBS0-33
KNOCKOUT-GROWTH-OBSERVATIONS - OBS0-49
KNOCKOUT-GROWTH-OBSERVATIONS - OBS0-44
LAST-UPDATE - 3609256889
LEFT-END-POSITION - 1978212
MEMBER-SORT-FN - NUMBERED-CLASS-SORT-FN
PRODUCT - TREHALOSE6PSYN-MONOMER
RIGHT-END-POSITION - 1979636
TRANSCRIPTION-DIRECTION - -
//
[download]

code

use strict;
use warnings;
use Data::Dumper;
##
my $inGeneDat=$ARGV[0] || "genes.dat";
open(IN,"<",$inGeneDat) || die "cannot open $inGeneDat\n";

##
my %HoNms;
{
  local $/ = '//';
  while(my $record=<IN>)
  {
    my %values = $record =~ /^(\S+)\s+-\s+(\S+)/mg;
    next unless exists $values{'UNIQUE-ID'} and exists $values{'ACCESS
+ION-1'};
    # Your code using $values{'UNIQUE-ID'} and other values here
        my $cycID=$values{'UNIQUE-ID'};
        my $cycLoc=$values{'ACCESSION-1'};
        my $ECKLoc=$values{'ACCESSION-2'};
        my $EGLocL=$values{'DBLINKS'};
        $EGLocL=~/"(EG\S+)"/;
        my $EGLoc=$1;
        my $Nm=$values{'COMMON-NAME'};
        #
        $HoNms{$cycID} = { 'acc1' => $cycLoc,
                 'acc2' => $ECKLoc,
                'EG'=> $EGLoc,
                'nm' => $Nm
               };
  }
}

print Dumper(%HoNms);
close(IN);
[download]

Comment on file parsing Select or Download Code

Replies are listed 'Best First'.
Re: file parsing by RichardK (Parson) on Jan 19, 2015 at 14:09 UTC
You're going to have to store your data in a more complex data structure, have a read of Perl Data Structures Cookbook to give you some ideas. The last time you ask this hdb gave you a good starting point to read the file line by in 1113317. You'll just have to deal with the duplicate entries, there are a number of ways to do it but choosing which one will depend on exactly what you need to do.	[reply]
Re: file parsing by poj (Abbot) on Jan 19, 2015 at 14:20 UTC
#!perl use strict; use warnings; use Data::Dump 'pp'; ## my $inGeneDat = $ARGV[0] \|\| "genes.dat"; open IN,'<',$inGeneDat or die "cannot open $inGeneDat\n"; { local $/ = '//'; while ( my $record = <IN> ){ my %values=(); my @lines = split "\n",$record; for (@lines){ if (/^(UNIQUE-ID\|ACCESSION-\d\|DBLINKS\|COMMON-NAME) - (.*)/){ my ($key,$val) = ($1,$2); # print "$key $val\n"; if ($key eq 'DBLINKS'){ if ($val =~ /ECOGENE \"(EG\d+)\"/){ $values{$key} = $1; } } else { $values{$key} = $val; } } }; pp \%values; # build %HoNm as required } } close(IN); [download] poj	[reply] [d/l]
Re^2: file parsing by AWallBuilder (Beadle) on Jan 19, 2015 at 15:02 UTC
thanks - this is along the lines of what I was thinking. I moved the initializatin of %values to before the while loop though.	[reply]
Re: file parsing by Anonymous Monk on Jan 19, 2015 at 15:32 UTC
Perhaps it'd be better to invest into a somewhat more generic parser, something like this: `my (@records,$cur); while(<>) { chomp; if ($_ eq "//") { push @records, $cur if defined $cur; $cur = undef; } elsif (/^(.+?) - (.+)$/) { my ($key,$value) = ($1,$2); if (defined $cur->{$key}) { if (ref $cur->{$key}) { push @{$cur->{$key}}, $value } else { $cur->{$key} = [$cur->{$key}, $value] } } else { $cur->{$key} = $value } } else { warn "didn't handle input line: $_" } } push @records, $cur if defined $cur;` [download] Note that changing `@records` into a hash keyed by `UNIQUE-ID` is as simple as `my %records = map {$_->{'UNIQUE-ID'}=>$_} @records;` Output of the above code for your example input: Read more... (3 kB)	[reply] [d/l] [select]