Grammar translators

polettix has asked for the wisdom of the Perl Monks concerning the following question:

This is halfway between a SoPW and a Meditation, but here it goes...

Starting from Net::SIP, I wanted to develop a better parser based on the grammar in the RFC. Up to now, I only used Parse::RecDescent a couple of times, and I also saw something about Parse::YAPP. The thing is that producing a parser is more or less a two-fold process:

produce a grammar for the protocol in the grammar language of the tool of choice, AND
enrich the grammar with directives/commands/whatever to build up a suitable data structure

The first step should produce something that is capable of "validating" an input without giving something useful. The second step aims to give a data structure that represents the protocol message.

For the first part, many protocols already have a description. In particular, IETF protocols are often described using the ABNF (Augmented Backus-Naur Form), documented in RFC RFC 4234. But this is somehow far from the format of both parsing modules above. My question is: is there something about "automated grammar translation" around? Googling a bit on the matter did not show much, but chances are that I didn't look for the right thing.

I tried to put up a simple translator using Parse::RecDescent itself. In particular, I produced a description of the ABNF grammar in P::RD grammar language:

ALPHA  :  /[A-Za-z]/
BIT    :  /[01]/
CHAR   :  /[\x01-\x7f]/
CR     :  "\x0d"
CRLF   :  "\x0d\x0a"
CTL    :  /[\x00-\x1f\x7f]/
DIGIT  :  /\d/
DQUOTE :  "\x22"
HEXDIG :  /[\dA-F]/
HTAB   :  "\x09"
LF     :  "\x0a"
LWSP   :  /(?:(?:\x0d\x0a)?[\x09\x20])*/
OCTET  :  /[\x00-\xff]/
SP     :  "\x20"
VCHAR  :  /[\x21-\x7e]/
WSP    :  /[\x09\x20]/

_rulelist_alt_grp  : _c_wsps_opt c_nl { join "", $item[1], $item[2] }
_rulelist_alt      :  rule | _rulelist_alt_grp
rulelist           :  _rulelist_alt(s) { join "", @{$item[1]} }

rule               :  rulename defined_as elements c_nl { join "", @it
+em[1..4] }

rulename_opt       :  ALPHA | DIGIT | "-" { "_" }
rulename           :  ALPHA rulename_opt(s?) { join "", $item[1], @{$i
+tem[2]}}

defined_as         :  _c_wsps_opt ("=" | "=/") _c_wsps_opt 
   { join "", $item[1], ':', $item[3] }

elements           :  alternation _c_wsps_opt { join "", $item[1], $it
+em[2] }

c_wsp              :  WSP | (c_nl WSP { join "", @item[1..2]})
_c_wsps_opt        :  c_wsp(s?) { join "", @{$item[1]} }
_c_wsps_plus       :  c_wsp(s) { join "", @{$item[1]} }

c_nl               :  comment | CRLF

_WSP_or_VCHAR      :  WSP | VCHAR
comment            :  ";" _WSP_or_VCHAR(s?) CRLF {
   join "", "#", @{$item[2]}, $item[3];
}

_alternation_opt   :  _c_wsps_opt "/" _c_wsps_opt concatenation
   { join "", $item[1], '|', @item[3..4] }
alternation        :  concatenation _alternation_opt(s?)
   { join "", $item[1], @{$item[2]} }

_concatenation_opt :  _c_wsps_plus repetition { join "", @item[1..2]}
concatenation      :  repetition _concatenation_opt(s?)
   {join "", $item[1], @{$item[2]}}

repetition         :  repeat(?) element
   { 
      if (@{$item[1]}) {
         $return = join "", $item[2], $item[1][0] 
      }
      else {
         $return = $item[2];
      }
   }

_DIGITS_opt        :  DIGIT(s?) { join "", @{$item[1]} }
_repeat_full       :  _DIGITS_opt "*" _DIGITS_opt
   {
      my $min = length($item[1]) ? $item[1] : 0;
      my $max = length($item[3]) ? $item[3] : 'inf';
      $return = undef;
      if ($min == 0) {
         if ($max eq 'inf') { $return = "(s?)" }
         elsif ($max == 1) { $return = "(?)" }
         else { $return = "(0..$max)" }
      }
      elsif ($min == 1) {
         if ($max eq 'inf') { $return = "(s)" }
         elsif ($max == 1) { $return = "" }
         else { $return = "(..$max)" }
      }
      else {
         if ($max eq 'inf') { $return = "($min..)" }
         else { $return = "($min..$max)" }
      }
   }
_repeat_single     :  _DIGITS { "($item[1])" }
repeat             :  _repeat_full | _repeat_single

element            :  rulename | group | option 
                      | char_val | num_val | prose_val

group              :  "(" _c_wsps_opt alternation _c_wsps_opt ")"
   { join "", @item[1..5] }

option             :  "[" _c_wsps_opt alternation _c_wsps_opt "]"
   { join "", "(", @item[2..4], ")(?)" }

char_val           :  DQUOTE /[\x20\x21\x23-\x7e]*/ DQUOTE
   { '"' . quotemeta($item[2]) . '"' }

num_val            :  "%" (bin_val | dec_val | hex_val)

_BITS              :  BIT(s) {
   my $b = join "", @{ $item[1] };
   $b = ('0' x (8 - length($b))) . $b;
   $return = unpack 'H*', pack 'B*', $b;
}
_bin_val_concats   :  _BITS ("." _BITS)(s?) {
   join "", '"', map({"\\x$_"} $item[1], @{$item[2]}), '"';
}
_bin_val_range     : _BITS "-" _BITS {
   "/[\\x$item[1]-\\x$item[3]]/"
}
bin_val            :  "b" (_bin_val_range | _bin_val_concats)

_DIGITS            :  DIGIT(s) { join "", @{$item[1]} }
_DIGITS_x          :  DIGIT(s) { 
   my $d = join "", @{$item[1]};
   $return = unpack 'H*', pack 'C', $d;
}
_dec_val_concats   :  _DIGITS_x ("." _DIGITS_x)(s?) {
   join "", '"', map({ "\\" . "x$_" } $item[1], @{$item[2]}), '"';
}
_dec_val_range     :  _DIGITS_x "-" _DIGITS_x {
   "/[\\x$item[1]-\\x$item[3]]/"
}
dec_val            :  "d" (_dec_val_range | _dec_val_concats )

_HEXDIGS           :  HEXDIG(s) { join "", @{ $item[1] }}
_hex_val_concats   :  _HEXDIGS ("." _HEXDIGS)(s?) {
   join "", '"', map({ "\\x$_" } $item[1], @{$item[2]}), '"';
}
_hex_val_range     : _HEXDIGS "-" _HEXDIGS {
   "/[\\x$item[1]-\\x$item[3]]/"
}
hex_val            :  "x" (_hex_val_range | _hex_val_concats )

prose_val          :  "<" /[\x20-\x3d\x3f-\x73]*/ ">"
[download]

and, when saved as 'abnf-direct.prd', it seems to do its work together with the companion script:

#!/usr/bin/env perl
use strict;
use warnings;
use Parse::RecDescent;
use File::Slurp qw( slurp );

$Parse::RecDescent::skip = '';
my $abnf = slurp 'abnf-direct.prd';
my $parser = Parse::RecDescent->new($abnf);

(my $grammar = slurp(@ARGV ? shift : \*STDIN)) =~ s/(?<!\r)\n/\r\n/g;
my $parsed = $parser->rulelist(\$grammar);
print {*STDERR} "parse ok\n" if defined $parsed;
print {*STDERR} "parsed it all\n" unless length $grammar;

print {*STDOUT} $parsed;
[download]

The script takes a ABNF grammar description in input and produces a grammar that should be suitable for Parse::RecDescent. At this point, one should work on this generated grammar to add methods that build up the data structure.

Is this a total waste of time? Is there something that I should read about before going this way?

Update: added quotemeta in grammar.

Update: just read that someone someone had a similar problem in the past, and that merlyn had written about it.

Flavio
perl -ple'$_=reverse' <<<ti.xittelop@oivalf

Io ho capito... ma tu che hai detto?

Comment on Grammar translators Select or Download Code


XP is just a number
	PerlMonks