Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Parsing line by line

by AWallBuilder (Beadle)
on Jan 14, 2015 at 13:40 UTC ( [id://1113224]=perlquestion: print w/replies, xml ) Need Help??

AWallBuilder has asked for the wisdom of the Perl Monks concerning the following question:

Dear all, I am familiar with how to read in a table, parse the columns and create hashes holding the data . However, now I am dealing with input files where related variables are contained in subsequent rows as opposed to in the same column. Now, I am having a seemingly simple problem that my parsing logic doesn't seem to work anymore. It might be an error to do with when I initialize my variables, or what? Any help is appreciated.

sample input data

# UNIQUE-ID - GJDZ-5046 TYPES - BC-4 TYPES - Unclassified-Genes COMMON-NAME - STM14_5042 ACCESSION-1 - STM14_5042 CENTISOME-POSITION - 90.96536 COMPONENT-OF - CHROMOSOME-1-100 COMPONENT-OF - TUJDZ-2494 COMPONENT-OF - CHROMOSOME-1 LEFT-END-POSITION - 4430254 PRODUCT - GJDZ-5046-MONOMER RIGHT-END-POSITION - 4430427 TRANSCRIPTION-DIRECTION - - // UNIQUE-ID - GJDZ-1101 TYPES - BC-4 TYPES - Unclassified-Genes COMMON-NAME - focA ACCESSION-1 - STM14_1100 CENTISOME-POSITION - 20.85712 COMPONENT-OF - CHROMOSOME-1-23 COMPONENT-OF - TUJDZ-587 COMPONENT-OF - CHROMOSOME-1 LEFT-END-POSITION - 1015797 PRODUCT - GJDZ-1101-MONOMER RIGHT-END-POSITION - 1016774 TRANSCRIPTION-DIRECTION - - //

snippet of code that gives wrong results

my %Ho14Loc2GeNm; while(my $lines=<IN>){ my $cycID14; my $cycLoc14;my $cycNm14; next unless (($lines =~/^UNIQUE-ID/) || ($lines=~/^ACCESSION-1 +/)|| ($lines=~/^COMMON-NAME/)); chomp $lines; if ($lines =~ /^UNIQUE-ID/){ $lines=~/(GJDZ-[0-9]+)/; $cycID14=$1; } if ($lines =~ /^COMMON-NAME/){ $lines=~/COMMON-NAME - (\S+)/; $cycNm14=$1; } if ($lines =~ /^ACCESSION-1/){ $lines=~/(STM14_[0-9]+)/; $cycLoc14=$1; } if (defined($cycLoc14)){ $Ho14Loc2GeNm{$cycLoc14}=$cycNm14; } } print Dumper(%Ho14Loc2GeNm); close(IN);

Replies are listed 'Best First'.
Re: Parsing line by line
by Eily (Monsignor) on Jan 14, 2015 at 14:02 UTC

    Since your records are separated by // you can change the value of $/, the input record separator so that instead of just one line, you will get a whole group at each iteration.

    And with the /m modifier you can use ^ to mean the beginning of a line anywhere in your string.

    Your input file seems to have a well defined format: "IDENTIFIER - value", so I think the easiest way to read it is to get all values in a hash, and then fetch the ones you need:

    { local $/ = '//'; while(my $record=<IN>) { my %values = $record =~ /^([-A-Z]+)\s+-\s+(.*)/mg; next unless exists $values{'UNIQUE-ID'} and exists $values{'ACCESS +ION'}; # Your code using $values{'UNIQUE-ID'} and other values here } }

      thanks, I need to look into this input record separator.
Re: Parsing line by line
by choroba (Cardinal) on Jan 14, 2015 at 13:57 UTC
    Two issues:
    1. You can't declare the lexical variables inside the loop. It creates new variables for each iteration, i.e. for each input line, so the values are not preserved. Move the declaration before the while.
    2. After adding a new record to the hash, clear the variables. It's enough to add the following line after the assignment:
      undef $cycLoc14;

    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      thanks - #2 was what I was looking for. had conceptually thought about this but didn't know how to do it. had realized #1 earlier thanks
Re: Parsing line by line
by hdb (Monsignor) on Jan 15, 2015 at 10:49 UTC

    You could read in all of the data into an array of hash(references). A new hash(ref) would be added every time you encounter a UNIQUE-ID.

    use strict; use warnings; use Data::Dumper; my @parsed; while(<DATA>){ next unless /^(.*) - (.*)$/; push @parsed, { } if $1 eq "UNIQUE-ID"; $parsed[-1]{ $1 } = $2; } print Dumper \@parsed; __DATA__ # UNIQUE-ID - GJDZ-5046 TYPES - BC-4 TYPES - Unclassified-Genes COMMON-NAME - STM14_5042 ACCESSION-1 - STM14_5042 CENTISOME-POSITION - 90.96536 COMPONENT-OF - CHROMOSOME-1-100 COMPONENT-OF - TUJDZ-2494 COMPONENT-OF - CHROMOSOME-1 LEFT-END-POSITION - 4430254 PRODUCT - GJDZ-5046-MONOMER RIGHT-END-POSITION - 4430427 TRANSCRIPTION-DIRECTION - - // UNIQUE-ID - GJDZ-1101 TYPES - BC-4 TYPES - Unclassified-Genes COMMON-NAME - focA ACCESSION-1 - STM14_1100 CENTISOME-POSITION - 20.85712 COMPONENT-OF - CHROMOSOME-1-23 COMPONENT-OF - TUJDZ-587 COMPONENT-OF - CHROMOSOME-1 LEFT-END-POSITION - 1015797 PRODUCT - GJDZ-1101-MONOMER RIGHT-END-POSITION - 1016774 TRANSCRIPTION-DIRECTION - - //

    Duplicate entries would be overwritten though.

A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1113224]
Approved by choroba
Front-paged by choroba
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2024-04-23 18:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found