Parsing a multiline data structure

HamNRye has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I work for a newspaper and we are trying to generate a page report from our classified system. We plan to do this by parsing the "PIL" file that calls all of the elements for pagination.

I need to get all of the "object" sections out of a file, and they are multi-line sections.

The file starts with a line like so:

pageoutput-5.0.1.0:20001201,TD,*,*,,FOO,2; {
[download]

Then there is a section that defines the path to the EPS's that make up the section:

name-table { 
"DAILY.evo" atex-itf unix-filename "/atexdata/clspag-dataD5/tmp/cpgfol
+2_018356_00" 
"1400.cm1" atex-itf unix-filename "/atexdata/clspag-dataD5/furn-conten
+t/header/bin/T_1400.cm1" 
"1476 .wr1" atex-itf unix-filename "/atexdata/clspag-dataD5/furn-conte
+nt/header/bin/T_1476.wr1" 
"608373" eps unix-filename "/atexdata/clspag-dataD5/ad-content/71/1432
+671" 
.........(etc.)
[download]

Next section sets up the page dims:

canvas { 
      user-name "ATEX-CLSPAG:20001201,TD,*,*,,F,2,WHEELS,2;"     
      units 35278 
      dimensions { 9360 15241 }
      application { 
                   app-name "ATEX-REFLEX" 
      string                  { last-modified-time "20001203:14:24:46"
+ }
      string                  { status "UNDEFINED" }
      
         }
      }
[download]

And the we have "object" sections for each element:

object { 
     origin { 0 0 }
     dimensions { 9360 130 }
     id "DAILY.evo" 
     user-name "DAILY.evo" 
     rc-name "DAILY.evo" 
     application { 
        app-name "ATEX-FPO-DUMMY-CONTROL" 
     string                  { major-type "FRN" }
     string                  { ad-number "DAILY.evo" }
         }
     }
object { 
     origin { 0 130 }
     dimensions { 900 1 }
     id "1400.cm1" 
     user-name "1400.cm1" 
     rc-name "1400.cm1" 
     application { 
       app-name "ATEX-FPO-DUMMY-CONTROL" 
     string                  { major-type "FRN" }
     string                  { ad-number "1400.cm1"}
        }
     }
[download]

And these repeat for each ad on the page. How would I get each of these object sections?? I would like to store them in a hash called %object and key them by the "id" field in the object section. I also will need to store the object origin and dimension.

example:
%object{1400.cm1;originw} ; # References Origin Width for 1400.cm1
[download]

I know this must seem simple to many of you, but I don't really understand stepping through a file and getting this all out. Thanks for any help. (Didn't see this in TFM, but i might just not know where to look...)

~HamNRye

Comment on Parsing a multiline data structure Select or Download Code

Replies are listed 'Best First'.
Re: Parsing a multiline data structure by Fastolfe (Vicar) on Dec 09, 2000 at 22:30 UTC
If you can describe a yacc-like grammer for this, you may be interested in Parse::RecDescent, which would pull all of this into Perl for you. I don't have much experience with it, but for complex or nested data, it's usually going to be easier to use than, say, Perl regular expressions at getting at this data.	[reply]
Re: Re: Parsing a multiline data structure by HamNRye (Monk) on Dec 10, 2000 at 02:31 UTC
Goes to show that if you plug away the answer will come.... while (<PIL>) { # print ; if (m/\ object\ \{/) { @lines = $_ ; $objcount++ ; } elsif (m/^\ \}/) { &Parse_Obj(@lines) ; # print "Found Match!\n" ; } else { push @lines, $_ ; } } sub Parse_Obj { $id = @_[3] ; $origin = @_[1] ; $dims = @_[2] ; $id =~ s/^.?\"//g ; #remove up to first " $id =~ s/\".\n$//g ; #remove everything from last " $origin =~ s/^.?\{\ //g ; #remove up to first { $origin =~ s/\ \}.\n$//g ; #remove from last } $dims =~ s/^.?\{\ //g ; #remove up to first { $dims =~ s/\ \}.\n$//g ; #remove from last } ( $originw, $originh ) = split / /, $origin ; ( $dimsw, $dimsh ) = split / /, $dims ; # print "$objcount : $id $originw $originh $dimsw $dimsh \n" ; if (exists($object{$objcount})) { break } else { $object{"$objcount$;id"} = $id ; $object{"$objcount$;originw"} = $originw ; $object{"$objcount$;originh"} = $originh ; $object{"$objcount$;dimsw"} = $dimsw ; $object{"$objcount$;dimsh"} = $dimsh ; } } # print "join(" ", %object) \n" ; for ($count < $objcount) { print "$object{$count,id} $object{$count,originw} $object{$count,origi +nh} \n" ; } [download] Thanks for everybody's help. ~Das Ham Man	[reply] [d/l]
Re (tilly) 1: Parsing a multiline data structure by tilly (Archbishop) on Dec 10, 2000 at 02:35 UTC
I will have to second Fastolfe's recommendation. That really is the best long-term solution. But short-term you can roll something quick and dirty. Now your basic problem is that you need to recognize seeing "object" and then extract nested parens. But unfortunately for you you cannot possibly handle arbitrary nesting with an RE. However the following code works for me with the example data you gave: use strict; use Carp; my @objects = get_objects("data"); print join "\n\n", @objects; sub get_block { my $a_ref = shift; my $open = 0; my @read; while (my $line = shift @$a_ref) { push @read, $line; $open += () = ($line =~ /\{/g); $open -= () = ($line =~ /\}/g); if (0 == $open) { return join '', @read; } elsif (0 > $open) { my $not_block = join '', @read; warn("Too many closing parentheses in:\n$not_block\n\n"); unshift @$a_ref, @read; return; } } confess "Unclosed brace at end of file\n"; } # Takes a filename, returns the contents. sub get_file { my $file = shift; local IN; open (IN, "< $file") or confess "Cannot read '$file': $!"; if (wantarray) { return <IN>; } else { return join '', <IN>; } } # Takes the name of a pil-file and returns an array of object blocks. # Not particularly robust. sub get_objects { my @objs; my @lines = get_file(shift); while (my $line = shift @lines) { if ($line =~ /object\s\{/) { # Found a block? unshift @lines, $line; push @objs, get_block(\@lines); } } return @objs; } [download] Note that this is rather fragile though. (For instance I assume that parens don't appear in quoted fields, I assume that blocks start and end on lines without other things in them, I assume that there are not things looking like the start of a block elsewhere in the file...) And note that once you have your array of objects you will want to try to try to match specific patterns to extract data you want to extract out of the objects. (Data like the id, origin, etc.) If you have the energy to learn Parse::RecDescent and figure out a real grammar, it will be much much better than a hack like the above...	[reply] [d/l]
Re: Parsing a multiline data structure by Albannach (Monsignor) on Dec 10, 2000 at 03:51 UTC
I've found the other solutions quite illuminating, but not wanting to be left out (and because I already did it ;-), here's my kick at it. The patterns could use some loosening on whitespace, I'm hoping no objects have missing attributes (you'd get the previous ones), and you can't have an object nested inside another with this, but it appears sufficient for the sample data. Hmm... I'm also ignoring the extraction of entire objects that tilly handles so well. Now I'll go read up on Parse::RecDecent - so many monks can't be wrong! use strict; my %object; open IN, '<'.pop or die "can't open data: $!\n"; until(<IN> =~ /^object\s\{/) {}; #skip ahead to the body of the fir +st object{} my($id, $origin, $dims); while(<IN>) { SWITCH: { /id "(.+)"/ && do { $id = $1; last SWITCH; }; /origin \{ (\d+) (\d+) \}/ && do { $origin = "$1,$2"; last SWITCH; }; /dimensions \{ (\d+) (\d+) \}/ && do { $dims = "$1,$2"; last SWITCH; }; (/^object\s\{/ \|\| eof(IN)) && do { $object{$id}{'origin'} = + $origin; $object{$id}{'dims'} = + $dims; }; # default: on to the next line }#SWITCH } foreach my $id (keys %object) { print "$id: "; foreach my $attr (keys %{$object{$id}} ) { print "$attr = $object{$id}{$attr} "; } print "\n"; } [download] produces: `DAILY.evo: origin = 0,0 dims = 9360,130 1400.cm1: origin = 0,130 dims = 900,1` [download] -- I'd like to be able to assign to an luser	[reply] [d/l] [select]
Re: Parsing a multiline data structure by eg (Friar) on Dec 10, 2000 at 14:09 UTC
In case anyone wants to know what it would look like, a Parse::RecDescent parser would be something like this: #!/usr/bin/perl -w use strict; use Parse::RecDescent; sub Parse::RecDescent::dispatch { my $self = shift(); my ($key, $value) = @_; print "$key: "; if ( ref($value) eq 'ARRAY' ) { print join(", ", @$value); } else { print $value; } print "\n"; } my $parser = new Parse::RecDescent << '__GRAMMAR__'; Pil: Title '{' NameTable Canvas Object(s) '}' /\Z/ Title: /^pageoutput-[^;]+;/ # the name-table NameTable: 'name-table' '{' NTLine(s) '}' NTLine: FileName FileType 'unix-filename' FileName FileName: '"' /[^"]+/ '"' FileType: 'atex-itf' \| 'eps' # the canvas Canvas: 'canvas' '{' Data '}' { print "============================= CANVAS\n"; } # the objects Object: 'object' '{' Data '}' { print "============================= OBJECT\n"; } Data: KeyValue(s) KeyValue: Key Value { $thisparser->dispatch( $item[1], $item[2] ); } Key: /^\S+/ Value: '"' /[^"]+/ '"' { $item[2]; } \| Number \| '{' Array '}' { $item[2]; } Array: Number(s) \| KeyValue(s) Number: /\d+/ __GRAMMAR__ undef $/; $parser->Pil(<STDIN>); [download] This is incomplete. What is left to do is to associate the various key/value pairs with their respective owners (either a 'canvas' or 'object'). If you try it out, note that an object's key/value pairs are printed before that object.	[reply] [d/l]