AWallBuilder has asked for the wisdom of the Perl Monks concerning the following question:

Dear all, This is related to an earlier post, and I have made progress, but I am stuck again. I am parsing a file (partial example below), and reading into a hash %values. However, I want to extract info from the Ecogene line (eg. DBLINKS - (ECOGENE)). I think my %values hash will not have separate vale/record pairs for the 5 different DBLINKS. Is there anyway I can just extract the one for ECOGENE. I tried one way in my code, but it is wrong as I realize the problem is in reading into the hash. Any help appreciated.

portion of input file

// UNIQUE-ID - EG11751 TYPES - BC-5.5.2 TYPES - BC-1.7.9 TYPES - BC-5.5.1 COMMON-NAME - otsA ACCESSION-1 - b1896 ACCESSION-2 - ECK1895 CENTISOME-POSITION - 42.636864 COMMENT-INTERNAL - 1/24/05 keseler removed pexA as synonym COMPONENT-OF - COLI-K12-39 COMPONENT-OF - TU0-7722 COMPONENT-OF - TU00391 COMPONENT-OF - TU00312 DBLINKS - (ECOLIHUB "otsA" NIL |kr| 3474243543 NIL NIL) DBLINKS - (REGULONDB "EG11751" NIL |kr| 3462030648 NIL NIL) DBLINKS - (ASAP "ABE-0006318" NIL |paley| 3398447608 NIL NIL) DBLINKS - (ECHOBASE "EB1701" NIL |pkarp| 3346767936 NIL NIL) DBLINKS - (ECOGENE "EG11751" NIL |pick| 3292798423 NIL NIL) DBLINKS - (OU-MICROARRAY "b1896" NIL NIL NIL NIL NIL) DBLINKS - (CGSC "18073" NIL |pkarp| 3035559680 NIL NIL) KNOCKOUT-GROWTH-OBSERVATIONS - OBS0-40 KNOCKOUT-GROWTH-OBSERVATIONS - OBS0-37 KNOCKOUT-GROWTH-OBSERVATIONS - OBS0-33 KNOCKOUT-GROWTH-OBSERVATIONS - OBS0-49 KNOCKOUT-GROWTH-OBSERVATIONS - OBS0-44 LAST-UPDATE - 3609256889 LEFT-END-POSITION - 1978212 MEMBER-SORT-FN - NUMBERED-CLASS-SORT-FN PRODUCT - TREHALOSE6PSYN-MONOMER RIGHT-END-POSITION - 1979636 TRANSCRIPTION-DIRECTION - - //

code

use strict; use warnings; use Data::Dumper; ## my $inGeneDat=$ARGV[0] || "genes.dat"; open(IN,"<",$inGeneDat) || die "cannot open $inGeneDat\n"; ## my %HoNms; { local $/ = '//'; while(my $record=<IN>) { my %values = $record =~ /^(\S+)\s+-\s+(\S+)/mg; next unless exists $values{'UNIQUE-ID'} and exists $values{'ACCESS +ION-1'}; # Your code using $values{'UNIQUE-ID'} and other values here my $cycID=$values{'UNIQUE-ID'}; my $cycLoc=$values{'ACCESSION-1'}; my $ECKLoc=$values{'ACCESSION-2'}; my $EGLocL=$values{'DBLINKS'}; $EGLocL=~/"(EG\S+)"/; my $EGLoc=$1; my $Nm=$values{'COMMON-NAME'}; # $HoNms{$cycID} = { 'acc1' => $cycLoc, 'acc2' => $ECKLoc, 'EG'=> $EGLoc, 'nm' => $Nm }; } } print Dumper(%HoNms); close(IN);

Replies are listed 'Best First'.
Re: file parsing
by RichardK (Parson) on Jan 19, 2015 at 14:09 UTC

    You're going to have to store your data in a more complex data structure, have a read of Perl Data Structures Cookbook to give you some ideas.

    The last time you ask this hdb gave you a good starting point to read the file line by in 1113317. You'll just have to deal with the duplicate entries, there are a number of ways to do it but choosing which one will depend on exactly what you need to do.

Re: file parsing
by poj (Abbot) on Jan 19, 2015 at 14:20 UTC
    #!perl use strict; use warnings; use Data::Dump 'pp'; ## my $inGeneDat = $ARGV[0] || "genes.dat"; open IN,'<',$inGeneDat or die "cannot open $inGeneDat\n"; { local $/ = '//'; while ( my $record = <IN> ){ my %values=(); my @lines = split "\n",$record; for (@lines){ if (/^(UNIQUE-ID|ACCESSION-\d|DBLINKS|COMMON-NAME) - (.*)/){ my ($key,$val) = ($1,$2); # print "$key $val\n"; if ($key eq 'DBLINKS'){ if ($val =~ /ECOGENE \"(EG\d+)\"/){ $values{$key} = $1; } } else { $values{$key} = $val; } } }; pp \%values; # build %HoNm as required } } close(IN);
    poj
      thanks - this is along the lines of what I was thinking. I moved the initializatin of %values to before the while loop though.
Re: file parsing
by Anonymous Monk on Jan 19, 2015 at 15:32 UTC

    Perhaps it'd be better to invest into a somewhat more generic parser, something like this:

    my (@records,$cur); while(<>) { chomp; if ($_ eq "//") { push @records, $cur if defined $cur; $cur = undef; } elsif (/^(.+?) - (.+)$/) { my ($key,$value) = ($1,$2); if (defined $cur->{$key}) { if (ref $cur->{$key}) { push @{$cur->{$key}}, $value } else { $cur->{$key} = [$cur->{$key}, $value] } } else { $cur->{$key} = $value } } else { warn "didn't handle input line: $_" } } push @records, $cur if defined $cur;

    Note that changing @records into a hash keyed by UNIQUE-ID is as simple as my %records = map {$_->{'UNIQUE-ID'}=>$_} @records;

    Output of the above code for your example input: