Parsing line by line

AWallBuilder has asked for the wisdom of the Perl Monks concerning the following question:

Dear all, I am familiar with how to read in a table, parse the columns and create hashes holding the data . However, now I am dealing with input files where related variables are contained in subsequent rows as opposed to in the same column. Now, I am having a seemingly simple problem that my parsing logic doesn't seem to work anymore. It might be an error to do with when I initialize my variables, or what? Any help is appreciated.

sample input data

#
UNIQUE-ID - GJDZ-5046
TYPES - BC-4
TYPES - Unclassified-Genes
COMMON-NAME - STM14_5042
ACCESSION-1 - STM14_5042
CENTISOME-POSITION - 90.96536    
COMPONENT-OF - CHROMOSOME-1-100
COMPONENT-OF - TUJDZ-2494
COMPONENT-OF - CHROMOSOME-1
LEFT-END-POSITION - 4430254
PRODUCT - GJDZ-5046-MONOMER
RIGHT-END-POSITION - 4430427
TRANSCRIPTION-DIRECTION - -
//
UNIQUE-ID - GJDZ-1101
TYPES - BC-4
TYPES - Unclassified-Genes
COMMON-NAME - focA
ACCESSION-1 - STM14_1100
CENTISOME-POSITION - 20.85712    
COMPONENT-OF - CHROMOSOME-1-23
COMPONENT-OF - TUJDZ-587
COMPONENT-OF - CHROMOSOME-1
LEFT-END-POSITION - 1015797
PRODUCT - GJDZ-1101-MONOMER
RIGHT-END-POSITION - 1016774
TRANSCRIPTION-DIRECTION - -
//
[download]

snippet of code that gives wrong results

my %Ho14Loc2GeNm;
while(my $lines=<IN>){
        my $cycID14; my $cycLoc14;my $cycNm14;
        next unless (($lines =~/^UNIQUE-ID/) || ($lines=~/^ACCESSION-1
+/)|| ($lines=~/^COMMON-NAME/));
        chomp $lines;
        if ($lines =~ /^UNIQUE-ID/){
                $lines=~/(GJDZ-[0-9]+)/;
                $cycID14=$1;
        }
        if ($lines =~ /^COMMON-NAME/){
                $lines=~/COMMON-NAME - (\S+)/;
                $cycNm14=$1;
        }
        if ($lines =~ /^ACCESSION-1/){
                $lines=~/(STM14_[0-9]+)/;
                $cycLoc14=$1;
        }
        if (defined($cycLoc14)){
                $Ho14Loc2GeNm{$cycLoc14}=$cycNm14;
        }
}
print Dumper(%Ho14Loc2GeNm);
close(IN);
[download]

Comment on Parsing line by line Select or Download Code

Replies are listed 'Best First'.
Re: Parsing line by line by Eily (Monsignor) on Jan 14, 2015 at 14:02 UTC
Since your records are separated by // you can change the value of $/, the input record separator so that instead of just one line, you will get a whole group at each iteration. And with the /m modifier you can use ^ to mean the beginning of a line anywhere in your string. Your input file seems to have a well defined format: "IDENTIFIER - value", so I think the easiest way to read it is to get all values in a hash, and then fetch the ones you need: `{ local $/ = '//'; while(my $record=<IN>) { my %values = $record =~ /^([-A-Z]+)\s+-\s+(.*)/mg; next unless exists $values{'UNIQUE-ID'} and exists $values{'ACCESS +ION'}; # Your code using $values{'UNIQUE-ID'} and other values here } }` [download]	[reply] [d/l]
Re^2: Parsing line by line by AWallBuilder (Beadle) on Jan 14, 2015 at 14:53 UTC
thanks, I need to look into this input record separator.	[reply]
Re: Parsing line by line by choroba (Cardinal) on Jan 14, 2015 at 13:57 UTC
Two issues: You can't declare the lexical variables inside the loop. It creates new variables for each iteration, i.e. for each input line, so the values are not preserved. Move the declaration before the `while`. After adding a new record to the hash, clear the variables. It's enough to add the following line after the assignment: `undef $cycLoc14;` [download] Read more... (1464 Bytes) لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l] [select]
Re^2: Parsing line by line by AWallBuilder (Beadle) on Jan 14, 2015 at 14:42 UTC
thanks - #2 was what I was looking for. had conceptually thought about this but didn't know how to do it. had realized #1 earlier thanks	[reply]
Re: Parsing line by line by hdb (Monsignor) on Jan 15, 2015 at 10:49 UTC
You could read in all of the data into an array of hash(references). A new hash(ref) would be added every time you encounter a UNIQUE-ID. use strict; use warnings; use Data::Dumper; my @parsed; while(<DATA>){ next unless /^(.) - (.)$/; push @parsed, { } if $1 eq "UNIQUE-ID"; $parsed[-1]{ $1 } = $2; } print Dumper \@parsed; __DATA__ # UNIQUE-ID - GJDZ-5046 TYPES - BC-4 TYPES - Unclassified-Genes COMMON-NAME - STM14_5042 ACCESSION-1 - STM14_5042 CENTISOME-POSITION - 90.96536 COMPONENT-OF - CHROMOSOME-1-100 COMPONENT-OF - TUJDZ-2494 COMPONENT-OF - CHROMOSOME-1 LEFT-END-POSITION - 4430254 PRODUCT - GJDZ-5046-MONOMER RIGHT-END-POSITION - 4430427 TRANSCRIPTION-DIRECTION - - // UNIQUE-ID - GJDZ-1101 TYPES - BC-4 TYPES - Unclassified-Genes COMMON-NAME - focA ACCESSION-1 - STM14_1100 CENTISOME-POSITION - 20.85712 COMPONENT-OF - CHROMOSOME-1-23 COMPONENT-OF - TUJDZ-587 COMPONENT-OF - CHROMOSOME-1 LEFT-END-POSITION - 1015797 PRODUCT - GJDZ-1101-MONOMER RIGHT-END-POSITION - 1016774 TRANSCRIPTION-DIRECTION - - // [download] Duplicate entries would be overwritten though.	[reply] [d/l]
A reply falls below the community's threshold of quality. You may see it by logging in.


Your skill will accomplish what the force of many cannot
	PerlMonks