The first step should produce something that is capable of "validating" an input without giving something useful. The second step aims to give a data structure that represents the protocol message.
For the first part, many protocols already have a description. In particular, IETF protocols are often described using the ABNF (Augmented Backus-Naur Form), documented in RFC RFC 4234. But this is somehow far from the format of both parsing modules above. My question is: is there something about "automated grammar translation" around? Googling a bit on the matter did not show much, but chances are that I didn't look for the right thing.
I tried to put up a simple translator using Parse::RecDescent itself. In particular, I produced a description of the ABNF grammar in P::RD grammar language:
ALPHA : /[A-Za-z]/
BIT : /[01]/
CHAR : /[\x01-\x7f]/
CR : "\x0d"
CRLF : "\x0d\x0a"
CTL : /[\x00-\x1f\x7f]/
DIGIT : /\d/
DQUOTE : "\x22"
HEXDIG : /[\dA-F]/
HTAB : "\x09"
LF : "\x0a"
LWSP : /(?:(?:\x0d\x0a)?[\x09\x20])*/
OCTET : /[\x00-\xff]/
SP : "\x20"
VCHAR : /[\x21-\x7e]/
WSP : /[\x09\x20]/
_rulelist_alt_grp : _c_wsps_opt c_nl { join "", $item[1], $item[2] }
_rulelist_alt : rule | _rulelist_alt_grp
rulelist : _rulelist_alt(s) { join "", @{$item[1]} }
rule : rulename defined_as elements c_nl { join "", @it
+em[1..4] }
rulename_opt : ALPHA | DIGIT | "-" { "_" }
rulename : ALPHA rulename_opt(s?) { join "", $item[1], @{$i
+tem[2]}}
defined_as : _c_wsps_opt ("=" | "=/") _c_wsps_opt
{ join "", $item[1], ':', $item[3] }
elements : alternation _c_wsps_opt { join "", $item[1], $it
+em[2] }
c_wsp : WSP | (c_nl WSP { join "", @item[1..2]})
_c_wsps_opt : c_wsp(s?) { join "", @{$item[1]} }
_c_wsps_plus : c_wsp(s) { join "", @{$item[1]} }
c_nl : comment | CRLF
_WSP_or_VCHAR : WSP | VCHAR
comment : ";" _WSP_or_VCHAR(s?) CRLF {
join "", "#", @{$item[2]}, $item[3];
}
_alternation_opt : _c_wsps_opt "/" _c_wsps_opt concatenation
{ join "", $item[1], '|', @item[3..4] }
alternation : concatenation _alternation_opt(s?)
{ join "", $item[1], @{$item[2]} }
_concatenation_opt : _c_wsps_plus repetition { join "", @item[1..2]}
concatenation : repetition _concatenation_opt(s?)
{join "", $item[1], @{$item[2]}}
repetition : repeat(?) element
{
if (@{$item[1]}) {
$return = join "", $item[2], $item[1][0]
}
else {
$return = $item[2];
}
}
_DIGITS_opt : DIGIT(s?) { join "", @{$item[1]} }
_repeat_full : _DIGITS_opt "*" _DIGITS_opt
{
my $min = length($item[1]) ? $item[1] : 0;
my $max = length($item[3]) ? $item[3] : 'inf';
$return = undef;
if ($min == 0) {
if ($max eq 'inf') { $return = "(s?)" }
elsif ($max == 1) { $return = "(?)" }
else { $return = "(0..$max)" }
}
elsif ($min == 1) {
if ($max eq 'inf') { $return = "(s)" }
elsif ($max == 1) { $return = "" }
else { $return = "(..$max)" }
}
else {
if ($max eq 'inf') { $return = "($min..)" }
else { $return = "($min..$max)" }
}
}
_repeat_single : _DIGITS { "($item[1])" }
repeat : _repeat_full | _repeat_single
element : rulename | group | option
| char_val | num_val | prose_val
group : "(" _c_wsps_opt alternation _c_wsps_opt ")"
{ join "", @item[1..5] }
option : "[" _c_wsps_opt alternation _c_wsps_opt "]"
{ join "", "(", @item[2..4], ")(?)" }
char_val : DQUOTE /[\x20\x21\x23-\x7e]*/ DQUOTE
{ '"' . quotemeta($item[2]) . '"' }
num_val : "%" (bin_val | dec_val | hex_val)
_BITS : BIT(s) {
my $b = join "", @{ $item[1] };
$b = ('0' x (8 - length($b))) . $b;
$return = unpack 'H*', pack 'B*', $b;
}
_bin_val_concats : _BITS ("." _BITS)(s?) {
join "", '"', map({"\\x$_"} $item[1], @{$item[2]}), '"';
}
_bin_val_range : _BITS "-" _BITS {
"/[\\x$item[1]-\\x$item[3]]/"
}
bin_val : "b" (_bin_val_range | _bin_val_concats)
_DIGITS : DIGIT(s) { join "", @{$item[1]} }
_DIGITS_x : DIGIT(s) {
my $d = join "", @{$item[1]};
$return = unpack 'H*', pack 'C', $d;
}
_dec_val_concats : _DIGITS_x ("." _DIGITS_x)(s?) {
join "", '"', map({ "\\" . "x$_" } $item[1], @{$item[2]}), '"';
}
_dec_val_range : _DIGITS_x "-" _DIGITS_x {
"/[\\x$item[1]-\\x$item[3]]/"
}
dec_val : "d" (_dec_val_range | _dec_val_concats )
_HEXDIGS : HEXDIG(s) { join "", @{ $item[1] }}
_hex_val_concats : _HEXDIGS ("." _HEXDIGS)(s?) {
join "", '"', map({ "\\x$_" } $item[1], @{$item[2]}), '"';
}
_hex_val_range : _HEXDIGS "-" _HEXDIGS {
"/[\\x$item[1]-\\x$item[3]]/"
}
hex_val : "x" (_hex_val_range | _hex_val_concats )
prose_val : "<" /[\x20-\x3d\x3f-\x73]*/ ">"
and, when saved as 'abnf-direct.prd', it seems to do its work together with the companion script:
#!/usr/bin/env perl
use strict;
use warnings;
use Parse::RecDescent;
use File::Slurp qw( slurp );
$Parse::RecDescent::skip = '';
my $abnf = slurp 'abnf-direct.prd';
my $parser = Parse::RecDescent->new($abnf);
(my $grammar = slurp(@ARGV ? shift : \*STDIN)) =~ s/(?<!\r)\n/\r\n/g;
my $parsed = $parser->rulelist(\$grammar);
print {*STDERR} "parse ok\n" if defined $parsed;
print {*STDERR} "parsed it all\n" unless length $grammar;
print {*STDOUT} $parsed;
The script takes a ABNF grammar description in input and produces a grammar that should be suitable for
. At this point, one should work on this generated grammar to add methods that build up the data structure.
Is this a total waste of time? Is there something that I should read about before going this way?