http://qs1969.pair.com?node_id=473596

bluetrust has asked for the wisdom of the Perl Monks concerning the following question:

Ahoy Perl Pirates,

I'm having trouble figuring out how to convert a macro language into a data structure for processing.

The macro script looks like this:
page p1 { question 4B { label { Do you like your pie with ice cream? } single { 1 Yes 2 No } } question 4C { label { Do you like your pie with whipped cream? } single { 1 Yes 2 No } } }
(It's for writing questionnaires.) There's an awful lot of nesting going on.

I was thinking that the exercise would be simpler if we processed the script by searching for the macro grammar patterns:
1. Block Type -- (word) (optional:word) { (nested patterns) } 2. String Type -- (string)
The in-memory data structure for processing it is all up to me, but I was thinking that it would probably look something like this in the end:
my $nestedhash = { type => page, name => p1, contains => [ <-one or more $nestedhash structures-> <-or, one or more simple scalars (for the strings)-> ], };
How on earth would you go about converting one to another?

I've looked into Parse::RecDescent and it seems to be ideal, but it's a complicated module and none of the tutorials I've looked at have an example dealing with nested grammar. If I could get it to work though, it looks like it would be easier to extend than the other solutions I had in mind involving a loop and either a regular expression for finding childless blocks -- or a floating reference that moves up and down the data structure as we process each line. (I've done those sorts of things before, and the code was always unreadable afterwards.)

What do you think, mateys?

Is there a simple solution to this?

Replies are listed 'Best First'.
Re: Parsing a macro language
by kvale (Monsignor) on Jul 08, 2005 at 22:57 UTC
    However you decide to translate the format, you will want to have a good description of the format, which is provided by a grammar. Once you have a grammar, converting into a format suitable for P::RD should be fairly straightforward.

    From your example format, a coarse version of the grammar would be something like

    <page> := /page p\d+ \{/ <question>* /\}/ <question> := /question \w+ {/ <label> <choices> /\}/ <label> := /label \{ [^}]+ \}/ <choices> := /single {/ <choice>+ /\}/ <choice> := /\d+ (Yes|No)/
    where I have not included whitespace elements in the regexes.

    Here the nesting, or hierarchy is represented by different grammar elements contained inside others.

    -Mark

Re: Parsing a macro language
by cmeyer (Pilgrim) on Jul 08, 2005 at 23:58 UTC

    Consider using an existing data serialization language.

    I enjoy using YAML for configuration information or other external data. It's advantages are that it is easy to read and write with any text editor, the syntax is fairly intuitive (it looks a lot like outlines done in email), and there are decent parsers and emitters for many common languages, including Perl (YAML).

    One of the very nice things about YAML is that it is a direct representation of scalar, array and hash data structures, nested arbitrarily. This means that the language is rich enough to represent any data structure that you can in Perl (or Python, Ruby, JavaScript, PHP, etc.). It also means that you don't have to perform interesting data contortions (or use objects, tree structures, etc.), as you might with XML. Rather, you use the structure directly. It's an AoHoAoH, or what ever you like.

    Your data in YAML might look like:
    #!/usr/local/bin/perl use YAML; use Data::Dumper; my $data = join '', <DATA>; my $pages = Load( $data ); # YAML does the syntax checking and parsing. # It is still up to you to perform data validation. print Dumper( $pages ); __DATA__ # this is your data, represented in YAML - page: p1 questions: - name: 4B label: Do you like your pie with ice cream? single: - Yes - No - name: 4C label: Do you like your pie with whipped cream? single: - Yes - No # and so on ... # - page: p2 # questions: # -
    The Perl data structure looks like this:
    $pages = [ { 'page' => 'p1', 'questions' => [ { 'name' => '4B', 'label' => 'Do you like your pie with i +ce cream?', 'single' => [ 'Yes', 'No' ], }, { 'name' => '4C', 'label' => 'Do you like your pie with w +hipped cream?', 'single' => [ 'Yes', 'No' ], } ] } ];

    -Colin.

    WHITEPAGES.COM | INC

Re: Parsing a macro language
by nothingmuch (Priest) on Jul 08, 2005 at 23:23 UTC
    If you want to avoid PRD, there's a few things you can do:
    • Write an event parser for your language:
      • pass events to an event handler object
      • the two tokens 'page foo' generate an event new_type("page","foo")which creates a new elem as a child of the element at the top of a stack
      • The token { puts the last child of the top of the stack on the top of the stack
      • the } token pops an element from the stack
      • anything that is not recognized as a "new thing" structure ((\w+)\s+(?:(\w+)\s+)?\{) is globbed up, and passed to the 'character_data' event, in your case, probably one per line
      • the event handler has a 'root' element predefined, at the top of the stack
    • use something like the event parser to convert the language with no state into XML or YAML or whatever, and use a parser for that
    • use ??{ } in regexes in a similar manner to the event parser handler. If you're going that way, you can nest expressions using ??{ }. See perlre for some devious tricks you can do with this construct. /msg me if you would like me to post an example.
    Update: it's done. it was fun, but don't use it. Someone below implemented the event parser I was talking about, just not in a decoupled OO kind of way.
    use strict; use warnings; use re 'eval'; my $str = <<FOO; page p1 { question 4B { label { Do you like your pie with ice cream? } single { 1 Yes 2 No } } question 4C { label { Do you like your pie with whipped cream? } single { 1 Yes 2 No } } } FOO my $string = qr/ ^ (?> \s* (.+) ) \s* $ (?{ add_string($^N) }) /xm; my $tokens; my ($type, $name); my $block = qr/ # capture a type (?: (\w+) \s+ ) (?{ $type = $^N }) ( # capture an optional name, set $name to that (?{ $name = undef }) # first unset $name, in case this doesn't + match ((?: (\w+) \s+ )(?{ $name = $^N }) )? ) \{ # if this starts to look like an element, push a new cell on th +e stack (?{ new_elem($type, $name) }) ( ( # this subpattern tries to capture a complete body, with t +he closing brace (??{ $tokens }) \} (?{ close_elem() }) # if we got here it means we have a fu +ll body, with tokens and a closing brace ) | ( # if we got here, then the body subpattern failed, and we + must abort (?{ abort_elem() }) (?!) # this match always fails because it negates a match +on anything, that always succeeds ) ) /xs; my $blocks = qr/($block \s*)+/xs; my $strings = qr/($string \s*)+?/xs; $tokens = qr/\s* ( $blocks | $strings ) \s*/xs; # tokens is either som +e strings, or some blocks my $doc = qr/^$tokens$/s; my @stack; new_elem("doc" => "root"); # create the root element $str =~ $doc; use Data::Dumper; warn Dumper(@stack); # should contain just the root element sub new_elem { my $elem = { type => $_[0], (defined($_[1]) ? (name => $_[1]) : ()), contains => [], }; if (@stack){ push @{ $stack[-1]{contains} }, $elem } push @stack, $elem; } sub abort_elem { pop @stack; pop @{ $stack[-1]{contains} }; } sub close_elem { pop @stack } sub add_string { push @{ $stack[-1]{contains} }, $_[0] }
    -nuffin
    zz zZ Z Z #!perl
Re: Parsing a macro language
by graff (Chancellor) on Jul 08, 2005 at 23:06 UTC
    My initial reaction would be that pagination is a secondary concern, and that the list of questions is the root of the matter, so I'd start with an array of hashes -- or, if the index-naming of the questions is "non-linear" or "semantic" in some way (i.e. not just an ordered list, but a set of distinctly named entities), then make it a hash of hashes.

    In any case, the "outermost, primary" unit of organization is the "question", and features of each question are simply:

    • its position in the sequence (or its ID/name, which is presumably sortable in some way)
    • which page it should be on
    • what its label is
    • what its possible answers are

    so the structure could be:

    my @questions = [ { page => 'p1', idstr => '4B', label => "... ice cream?", answertype => 'single', answerset => { 1 => 'yes', 2 => 'no' } } ... ]; # or ... my %questions = { '4B' => { page => 'p1', label => '... ice cream?', ... } '4C' => { page => 'p1', label => '... whipped cream?', ... } ... };
    As for parsing the input text to fill that structure, there are numerous ways, and Parse::RecDescent would certainly do it (but it might be overkill -- other ways would suffice and be easier if you're really green with P::RD).
Re: Parsing a macro language
by simonm (Vicar) on Jul 09, 2005 at 01:09 UTC
    Is there a simple solution to this?

    Yes -- as long as your format doesn't have any more complexity that you haven't told us about yet.

    There's a general approach to tree-building that's applicable here: Keep a stack of your "active" container, and every time you find a line with a "{" on the end of it, push a new container on to the stack. On lines with a "}", pop the active item off the stack. Every other piece of data that we find can get pushed into whichever container is currently active.

    Of course, if your format includes escape characters, multi-line elements, or other complications, you may need to use one of the industrial-strength parser-generators... But for a simple format, we can roll our own.

    Take a look at the output of the below, with and without $fewer_indents set true, and modify as desired.

    sub parse_brackets { my @parse; my @stack = \@parse; my $fewer_indents = 1; # Try setting this to 0 or 1 my $line_no; foreach my $line ( @_ ) { $line_no ++; $line =~ s/\A\s+//; $line =~ s/\s+\Z//; if ( $line !~ /\S/ ) { next; } elsif ( $line =~ s/\s*\{$// ) { my @line = split ' ', $line; push @{ $stack[0] }, \@line; if ( $fewer_indents ) { unshift @stack, \@line; } else { my @kids = (); push @line, \@kids; unshift @stack, \@kids; } } elsif ( $line eq '}' ) { shift @stack; scalar @stack or die("Too many right brackets at line $line_no") +; } else { push @{ $stack[0] }, $line; } } return @parse; } use Data::Dumper; print Dumper( parse_brackets( split "\n", <<'EXAMPLE' ) ); page p1 { question 4B { label { Do you like your pie with ice cream? } single { 1 Yes 2 No } } question 4C { label { Do you like your pie with whipped cream? } single { 1 Yes 2 No } } } EXAMPLE

      Or, for a hash-based represenation that's closer to what you originally asked for:

      sub parse_brackets { my @parse; my @stack = \@parse; my $fewer_indents = 1; # Try setting this to 0 or 1 my $line_no; foreach my $line ( @_ ) { $line_no ++; $line =~ s/\A\s+//; $line =~ s/\s+\Z//; if ( $line !~ /\S/ ) { next; } elsif ( $line =~ s/\s*\{$// ) { my @line = split ' ', $line; my %node = ( type => shift(@line), ( @line ? ( name => shift(@li +ne) ) : () ) ); push @{ $stack[0] }, \%node; unshift @stack, do { $node{contents} = [] }; } elsif ( $line eq '}' ) { shift @stack; scalar @stack or die("Too many right brackets at line $line_no") +; } else { push @{ $stack[0] }, $line; } } scalar @stack == 1 or die("Too few right brackets at line $line_no") +; return @parse; }

      Output:

      $VAR1 = { 'contents' => [ { 'contents' => [ { 'contents' => [ 'Do you +like your pie with ice cream?' ], 'type' => 'label' }, { 'contents' => [ '1 Yes', '2 No' ], 'type' => 'single' } ], 'name' => '4B', 'type' => 'question' }, { 'contents' => [ { 'contents' => [ 'Do you +like your pie with whipped cream?' ], 'type' => 'label' }, { 'contents' => [ '1 Yes', '2 No' ], 'type' => 'single' } ], 'name' => '4C', 'type' => 'question' } ], 'name' => 'p1', 'type' => 'page' };

      Oh, and for future reference, I don't think I would call this a "macro language". It's a data markup format, with no functional or substitutional behavior.

        You could decouple the parsing and building steps, as nothingmuch suggested:
        sub parse_brackets { my @out; foreach my $line ( @_ ) { $line_no ++; $line =~ s/\A\s+//; $line =~ s/\s+\Z//; if ( $line !~ /\S/ ) { next; } elsif ( $line =~ s/\s*\{$// ) { push @out, 'open' => [ split ' ', $line ]; } elsif ( $line eq '}' ) { push @out, 'close' => 1; } else { push @out, 'line' => $line; } } return @out; } sub build_brackets { my @parse; my @stack = \@parse; while ( scalar @_ ) { my ( $type, $value ) = splice( @_, 0, 2 ); if ( $type eq 'open' ) { my %node = ( type => shift(@$value), ( @$value ? ( name => shift +(@$value) ) : () ) ); push @{ $stack[0] }, \%node; unshift @stack, do { $node{contents} = [] }; } elsif ( $type eq 'close' ) { shift @stack; scalar @stack or die("Too many right brackets"); } else { push @{ $stack[0] }, $value; } } scalar @stack == 1 or die("Too few right brackets"); return @parse; } use Data::Dumper; print Dumper( build_brackets( parse_brackets( @lines ) ) );

        I'm not sure this is a big win for your case, but if you need the extra flexiblity...

Re: Parsing a macro language
by TedPride (Priest) on Jul 09, 2005 at 06:44 UTC
    It's much easier to parse a structure if you mark the end types as well as the start. You could do something like the following:
    use strict; use warnings; use Data::Dumper; my $structure = main(join '',<DATA>); print Dumper($structure); sub main { my ($s, $c, %hash) = $_[0]; while ($s =~ /START PAGE(?: (\w+))?\s+(.*?)\s+END PAGE/gs) { $hash{$1 ? $1 : ++$c} = page($2); } return \%hash; } sub page { my ($s, $c, %hash) = $_[0]; while ($s =~ /START QUESTION(?: (\w+))?\s+(.*?)\s+END QUESTION/gs) + { $hash{$1 ? $1 : ++$c} = question($2); } return \%hash; } sub question { my ($s, %hash) = $_[0]; ($hash{'label'}) = $s =~ /LABEL (.*)/; $s =~ /START CHOICES\s+(.*?)\s+END CHOICES/s; for (split / *\n */, $1) { push @{$hash{'choices'}}, [split / /, $_, 2]; } return \%hash; } __DATA__ START PAGE p1 START QUESTION 4B LABEL Do you like your pie with ice cream? START CHOICES 1 Yes 2 No END CHOICES END QUESTION START QUESTION 4C LABEL Do you like your pie with whipped cream? START CHOICES 1 Yes 2 No END CHOICES END QUESTION END PAGE
    I've made the choices into arrays instead of hashes, to preserve order.
Re: Parsing a macro language
by themage (Friar) on Jul 09, 2005 at 14:05 UTC
    Hi,

    I build a small parser to create your perl structure from your file. A part of this is a escape function, that would escape special chars in s///.

    The parsing is done specially by the parse.

    #!/usr/bin/perl -w use strict; use Data::Dumper; sub escape { my $str=shift; $str=~s/\\/\\\\/g; $str=~s/\[/\\[/g; $str=~s/\]/\\]/g; $str=~s/\(/\\(/g; $str=~s/\)/\\)/g; $str=~s/\{/\\{/g; $str=~s/\}/\\}/g; $str=~s/\^/\\^/g; $str=~s/\//\\\//g; $str=~s/\+/\\\+/g; $str=~s/\-/\\\-/g; $str=~s/\$/\\\$/g; $str=~s/\&/\\\&/g; $str=~s/\*/\\\*/g; $str=~s/\|/\\\|/g; $str=~s/\?/\\\?/g; return $str; }; sub parseoptions { my $options=shift; my @opts=(); $options=~s/\s+$//; for my $opt (split /\n/, $options) { $opt=~/^\s*(\w+)\s+(.*)$/; my $data={name=>$1, content=>$2}; unshift @opts, $data; } return \@opts; } sub parse { my $lookat=shift; my @struct=(); while ($$lookat=~/^\s*(\w+)\s+(\w+)?\s*{/) { my $data={type=>$1,name=>$2}; my $str=escape($&); $$lookat=~s/^$str//; if ($$lookat=~/^[^}]+{/) { $data->{content}=parse($lookat); $$lookat=~s/^\s*}//; } else { ($data->{content})=$$lookat=~/^\s*([^}]+)\s*}/ +; my $str=escape($&); $$lookat=~s/^$str//; if ($data->{type} eq 'single') { $data->{content}=parseoptions($data->{ +content}); } } unshift @struct, $data; } return \@struct; } my $quest=""; while (my $line=<>) { $line=~s/^\s+/ /g; $quest.=$line; } my $struct=parse(\$quest); print Dumper($struct);


    Hope this helps.

Re: Parsing a macro language
by hossman (Prior) on Jul 09, 2005 at 05:38 UTC

    Assuming all of your data is as clean looking as this example (ie: everything indented nicely, all questions exactly one line, etc..) then written a sequence of regexes to convert this to YAML -- or even directly to perl code -- should be pretty straight forward.