Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

RFC: A walkthrough from JSON ABNF to Regexp::Grammars

by Anonymous Monk
on Apr 04, 2012 at 02:25 UTC ( [id://963349]=perlmeditation: print w/replies, xml ) Need Help??

Say you have an ABNF grammer ( like JSON ) and you want to write a pure-perl parser, well you grab Convert Augmented BNF (ABNF) to Regexp::Grammars and you're on your way (token: token: token:)

step 0

Save this program as reg, run perl reg, and examine the output.

A familiarity with perlintro/perlrequick/perldsc is useful.

#!/usr/bin/perl -- use strict; use warnings; use Data::Dump; my $jp = do { use Regexp::Grammars; qr{ # Keep the big stick handy, just in case... #~ <debug:on> # Match this... ^ <object> | <array> $ <token: JSON_text> <object> | <array> <token: begin_array> (?: <ws>\x5B<ws> ) <token: begin_object> (?: <ws>\x7B<ws> ) <token: end_array> (?: <ws>\x5D<ws> ) <token: end_object> (?: <ws>\x7D<ws> ) <token: name_separator> (?: <ws>\x3A<ws> ) <token: value_separator> (?: <ws>\x2C<ws> ) <token: ws> (?: \x20 | \x09 | \x0A | \x0D )* <token: value> <false> | <null> | <true> | <object> | <array> | <number> | <string> <token: false> \x66\x61\x6c\x73\x65 <token: null> \x6e\x75\x6c\x6c <token: true> \x74\x72\x75\x65 <token: object> (?: <begin_object> (?: (?: <member> (?: (?: <value_separator><member +> ) )* ) )? <end_object> ) <token: member> (?: <string><name_separator><value> ) <token: array> (?: <begin_array> (?: (?: <value> (?: (?: <value_separator><value> ) + )* ) )? <end_array> ) <token: number> (?: (?: <minus> )? <int> (?: <frac> )? (?: <exp> )? ) <token: decimal_point> \x2E <token: digit1_9> [\x{31}-\x{39}] <token: e> \x65 | \x45 <token: exp> (?: <e> (?: <minus> | <plus> )? (?: <DIGIT> )+ ) <token: frac> (?: <decimal_point> (?: <DIGIT> )+ ) <token: int> <zero> | (?: <digit1_9> (?: <DIGIT> )* ) <token: minus> \x2D <token: plus> \x2B <token: zero> \x30 <token: string> (?: <quotation_mark> (?: <char> )* <quotation_mark> ) <token: char> <unescaped> | (?: <escape>\x22 | \x5C | \x2F | \x62 | \x66 | \x6E | +\x72 | \x74 | (?: \x75 (?: <HEXDIG> ){4} ) ) <token: escape> \x5C <token: quotation_mark> \x22 <token: unescaped> [\x{20}-\x{21}] | [\x{23}-\x{5B}] | [\x{5D}-\x{10FFFF}] <token: HEXDIG> <DIGIT> | A | B | C | D | E | F <token: DIGIT> [\x{30}-\x{39}] }ixs}; for my $str( q/[false,true,null]/ , ){ print "#<<@{[time-$^T]}# $str \n"; if( $str =~ $jp ){ dd(\%/) ; #/ } else { print "## fail to match \n"; } print "#>>@{[time-$^T]}#\n"; }

step 1

out:

#<<1# [false,true,null] { "" => "[false,true,null]", "array" => { "" => "[false,true,null]", "begin_array" => { "" => "[", "ws" => "" }, "end_array" => { "" => "]", "ws" => "" }, "value" => { "" => "null", "null" => "null" }, "value_separator" => { "" => ",", "ws" => "" }, }, } #>>1#

Do you notice  "value_separator" in the output? The first and easiest thing to do is get rid of stuff we don't want keep, which is all the stuff that isn't data, all the punctuation:

    value_separator
    begin_object
    end_object
    name_separator
    begin_array
    end_array
    quotation_mark

Changing <yo> to <.yo> is akin to changing ($yo) to (?:$yo) , it means match but do not save/capture.

Save the following diff as PATCHFILE then use patch -i PATCHFILE to edit reg , and then run perl reg again and examine the output. Lines starting with a single - are removed, lines starting with a single + are added, the other lines are context.

diff:

--- reg +++ reg @@ -49,3 +49,3 @@ <token: object> - (?: <begin_object> (?: (?: <member> (?: (?: <value_separator><membe +r> ) )* ) )? <end_object> ) + (?: <.begin_object> (?: (?: <member> (?: (?: <.value_separator><mem +ber> ) )* ) )? <.end_object> ) @@ -55,3 +55,3 @@ <token: array> - (?: <begin_array> (?: (?: <value> (?: (?: <value_separator><value> +) )* ) )? <end_array> ) + (?: <.begin_array> (?: (?: <value> (?: (?: <.value_separator><value +> ) )* ) )? <.end_array> ) @@ -88,3 +88,3 @@ <token: string> - (?: <quotation_mark> (?: <char> )* <quotation_mark> ) + (?: <.quotation_mark> (?: <char> )* <.quotation_mark> )

step 2

out:

#<<0# [false,true,null] { "" => "[false,true,null]", "array" => { "" => "[false,true,null]", "value" => { "" => "null", "null" => "null" }, }, } #>>0#

Notice in token: array, how <value> is repeated, but the Dumper output only shows a single hashref with for value, only the last one, null?

If we want to capture all the values, not just the last one, we have to tell Regexp::Grammars to save an array with <[value]>

diff:

--- reg +++ reg @@ -55,3 +55,3 @@ <token: array> - (?: <.begin_array> (?: (?: <value> (?: (?: <.value_separator><value +> ) )* ) )? <.end_array> ) + (?: <.begin_array> (?: (?: <[value]> (?: (?: <.value_separator><[va +lue]> ) )* ) )? <.end_array> )

step 3

out:

#<<0# [false,true,null] { "" => "[false,true,null]", "array" => { "" => "[false,true,null]", "value" => [ { "" => "false", "false" => "false" }, { "" => "true", "true" => "true" }, { "" => "null", "null" => "null" }, ], }, } #>>0#

Whoa, adding <[value]> sure increased the amount of keys in the hash  %/

Now the first key, the empty key, the "context" starts to clutter the Dumper output.

To turn this off for all rules add <nocontext: > to the top. This makes Dumper output much neater.

diff:

--- reg +++ reg @@ -7,2 +7,3 @@ #~ <debug:on> +<nocontext: > # Switch off context substring retention

step 4

out:

#<<0# [false,true,null] { array => { value => [{ false => "false" }, { true => "true" }, { null => "nul +l" }], }, } #>>0#

Now that is much less cluttered. Now to turn those true/false/null hashrefs into objects (and undef), we add a code blocks (?{ }) to each respective rule. A rule is a rule (named regex pattern) with magic whitespace; a token is a rule without magic whitespace.

diff:

--- reg +++ reg @@ -40,12 +40,15 @@ <token: false> \x66\x61\x6c\x73\x65 + (?{ $MATCH = ::FALSE(); }) <token: null> \x6e\x75\x6c\x6c + (?{ $MATCH = undef; }) <token: true> \x74\x72\x75\x65 + (?{ $MATCH = ::TRUE(); }) <token: object> (?: <.begin_object> (?: (?: <member> (?: (?: <.value_separator><mem +ber> ) )* ) )? <.end_object> ) @@ -117,3 +120,6 @@ } print "#>>@{[time-$^T]}#\n"; } +sub TRUE { bless \(my$s=!!1), 'TRUE' } +sub FALSE { bless \(my$s=!!0), 'FALSE' } +

step 5

out:

#<<1# [false,true,null] { array => { value => [ { false => bless(do{\(my $o = "")}, "FALSE") }, { true => bless(do{\(my $o = 1)}, "TRUE") }, { null => undef }, ], }, } #>>1#

Whoa, the hashrefs didn't go away, but we do get objects (good enough for now).

To get rid of those hashrefs we have to edit  token: value to add MATCH keyword , so instead of hashref with keys/values, we get just the values. We'll do this for all rules being called/matched in  token: value

diff:

--- reg +++ reg @@ -38,3 +38,3 @@ <token: value> - <false> | <null> | <true> | <object> | <array> | <number> | <string +> + <MATCH=false> | <MATCH=null> | <MATCH=true> | <MATCH=object> | <MAT +CH=array> | <MATCH=number> | <MATCH=string>

step 6

out:

#<<0# [false,true,null] { array => { value => [ bless(do{\(my $o = "")}, "FALSE"), bless(do{\(my $o = 1)}, "TRUE"), undef, ], }, } #>>0#

Much better. Now to collapse the hashref for  token: array , we edit  token: array to add a code block This is called manual result distillation.

diff:

--- reg +++ reg @@ -58,6 +58,9 @@ <token: array> (?: <.begin_array> (?: (?: <[value]> (?: (?: <.value_separator><[va +lue]> ) )* ) )? <.end_array> ) + (?{ + $MATCH = $MATCH{value} ; + }) <token: number> (?: (?: <minus> )? <int> (?: <frac> )? (?: <exp> )? )

step 7

out:

#<<1# [false,true,null] { array => [ bless(do{\(my $o = "")}, "FALSE"), bless(do{\(my $o = 1)}, "TRUE"), undef, ], } #>>1#

That looks spot on. Now to put some numbers in an array When dealing with new input, temporarily turn on context to get our bearings.

diff:

--- reg +++ reg @@ -8 +8 @@ -<nocontext: > # Switch off context substring retention +#~ <nocontext: > # Switch off context substring retention @@ -117 +117 @@ -for my $str( q/[false,true,null]/ , ){ +for my $str( q/[ -2.0, 4.333e333, 600 ]/ , ){

step 8

out:

#<<0# [ -2.0, 4.333e333, 600 ] { "" => "[ -2.0, 4.333e333, 600 ]", "array" => [ { "" => "-2.0", "frac" => { "" => ".0", "decimal_point" => ".", "DIGIT" => 0 }, "int" => { "" => 2, "digit1_9" => 2 }, "minus" => "-", }, { "" => "4.333e333", "exp" => { "" => "e333", "DIGIT" => 3, "e" => "e" }, "frac" => { "" => ".333", "decimal_point" => ".", "DIGIT" => 3 } +, "int" => { "" => 4, "digit1_9" => 4 }, }, { "" => 600, "int" => { "" => 600, "DIGIT" => 0, "digit1_9" => 6 } + }, ], } #>>0#

Whoa, that looks hairy. We're missing some digits and we got a bunch of hashrefs.

So to make sure no digits are missing , we make sure the repeated rules (those with quantifiers that can match more than once, quantifiers like * + and {n} ) are saved to an array. This means change <subrule> to <[subrule]>

diff:

--- reg +++ reg @@ -7,3 +7,3 @@ #~ <debug:on> -#~ <nocontext: > # Switch off context substring retention +<nocontext: > # Switch off context substring retention @@ -77,9 +77,9 @@ <token: exp> - (?: <e> (?: <minus> | <plus> )? (?: <DIGIT> )+ ) + (?: <e> (?: <minus> | <plus> )? (?: <[DIGIT]> )+ ) <token: frac> - (?: <decimal_point> (?: <DIGIT> )+ ) + (?: <decimal_point> (?: <[DIGIT]> )+ ) <token: int> - <zero> | (?: <digit1_9> (?: <DIGIT> )* ) + <zero> | (?: <digit1_9> (?: <[DIGIT]> )* )

step 9

out:

#<<1# [ -2.0, 4.333e333, 600 ] { array => [ { frac => { decimal_point => ".", DIGIT => [0] }, int => { digit1_9 => 2 }, minus => "-", }, { exp => { DIGIT => [3, 3, 3], e => "e" }, frac => { decimal_point => ".", DIGIT => [3, 3, 3] }, int => { digit1_9 => 4 }, }, { int => { DIGIT => [0, 0], digit1_9 => 6 } }, ], } #>>1#

Now that we're not missing digits anymore, we collapse unwanted repeaters and hashkeys. The repeaters are DIGIT (from rules exp/frac/int), the unwanted keys are e/minus/plus (from rule exp) , decimal_point (from rule frac), and zero/digit1_9 (from rule int).

diff:

--- reg +++ reg @@ -76,12 +76,21 @@ <token: exp> (?: <e> (?: <minus> | <plus> )? (?: <[DIGIT]> )+ ) + (?{ + $MATCH = join '', grep defined, $MATCH{e}, $MATCH{minus}, $MATCH{ +plus}, @{$MATCH{DIGIT}}; + }) <token: frac> (?: <decimal_point> (?: <[DIGIT]> )+ ) + (?{ + $MATCH = join '', grep defined, $MATCH{decimal_point}, @{$MATCH{D +IGIT}}; + }) <token: int> - <zero> | (?: <digit1_9> (?: <[DIGIT]> )* ) + <MATCH=zero> | (?: <digit1_9> (?: <[DIGIT]> )* ) + (?{ + $MATCH = join '', grep defined, $MATCH{digit1_9}, @{$MATCH{DIGIT} +}; + }) <token: minus> \x2D @@ -114,7 +123,7 @@ [\x{30}-\x{39}] }ixs}; -for my $str( q/[ -2.0, 4.333e333, 600 ]/ , ){ +for my $str( q/[ -2.0, 4.333e333, 600, 0, 9 ]/ , ){ print "#<<@{[time-$^T]}# $str \n"; if( $str =~ $jp ){ dd(\%/) ; #/

step 10

out:

#<<0# [ -2.0, 4.333e333, 600, 0, 9 ] { array => [ { frac => ".0", int => 2, minus => "-" }, { exp => "e333", frac => ".333", int => 4 }, { int => 600 }, { int => 0 }, { int => 9 }, ], } #>>0#

And we have to edit  token: number just like for  token: array to get rid of those hashrefs.

diff:

--- reg +++ reg @@ -65,4 +65,7 @@ <token: number> (?: (?: <minus> )? <int> (?: <frac> )? (?: <exp> )? ) + (?{ + $MATCH = join'', grep defined, map{$MATCH{$_}} qw{ minus int frac + exp }; + }) <token: decimal_point>

step 11

out:

#<<1# [ -2.0, 4.333e333, 600, 0, 9 ] { array => ["-2.0", "4.333e333", 600, 0, 9] } #>>1#

So far so good. Next we'll tackle an array of strings.

diff:

--- reg +++ reg @@ -8 +8 @@ -<nocontext: > # Switch off context substring retention +#~ <nocontext: > # Switch off context substring retention @@ -129 +129 @@ -for my $str( q/[ -2.0, 4.333e333, 600, 0, 9 ]/ , ){ +for my $str( q{[ "\\" quote \\t tab" ]} , ){

step 12

out:

#<<0# [ "\" quote \t tab" ] { "" => "[ \"\\\" quote \\t tab\" ]", "array" => [ { "" => "\"\\\" quote \\t tab\"", "char" => { "" => "b", "unescaped" => "b" }, }, ], } #>>0#

Notice we're now missing char's (there is only one), so we look for  <char> and find token: string and make sure its saved to an array.

diff:

--- reg +++ reg @@ -7,3 +7,3 @@ #~ <debug:on> -#~ <nocontext: > # Switch off context substring retention +<nocontext: > # Switch off context substring retention @@ -107,3 +107,3 @@ <token: string> - (?: <.quotation_mark> (?: <char> )* <.quotation_mark> ) + (?: <.quotation_mark> (?: <[char]> )* <.quotation_mark> )

step 13

out:

#<<0# [ "\" quote \t tab" ] { array => [ { char => [ { escape => "\\" }, { unescaped => " " }, { unescaped => "q" }, { unescaped => "u" }, { unescaped => "o" }, { unescaped => "t" }, { unescaped => "e" }, { unescaped => " " }, "\\", { unescaped => "t" }, { unescaped => " " }, { unescaped => "t" }, { unescaped => "a" }, { unescaped => "b" }, ], }, ], } #>>0#

Whoa, that is almost all chars but where did the  " go? Why is  "\\", not a reference?

 "\\", corresponds to the part of rule char ( token: char) that is not a Regexp::Grammars named-pattern (like  <unescaped> ) but plain regular expression syntax ( like   \x22 | \x5C | \x2F | \x62 | \x66 | \x6E | \x72 | \x74 )

This is getting complicated, the thing that needs escaping is a string, but the thing that doesn't is a reference. This is a perfect time to break from the oficial ABNF. We redefine  token: char rule, by defining a new escaped rule to complement unescaped, and collapse unescaped reference with MATCH keyword.

Instead of messing with MATCH= in the new token  token: escaped we'll skip the <HEXDIG> rule.

We also add some unicode escapes to our input string because they need special handling.

diff:

--- reg +++ reg @@ -108,0 +109,3 @@ + (?{ + $MATCH = token_string( @{ $MATCH{char} } ); + }) @@ -111 +114,4 @@ - <unescaped> | (?: <escape>\x22 | \x5C | \x2F | \x62 | \x66 | \x6E | + \x72 | \x74 | (?: \x75 (?: <HEXDIG> ){4} ) ) + <MATCH=unescaped> | (?: <.escape> <escaped> ) + +<token: escaped> + \x22 | \x5C | \x2F | \x62 | \x66 | \x6E | \x72 | \x74 | \x75[1-9A-F +]{4} @@ -129 +135 @@ -for my $str( q{[ "\\" quote \\t tab" ]} , ){ +for my $str( q{[ "\\" quote \\t tab \\uD83D\\uDC2A U+1F42A DROMEDARY +CAMEL" ]}, ){ @@ -140,0 +147,23 @@ +BEGIN { + my %rep = ( + "\"" => '"', + "/" => '/', + "\\" => '\\', + b => "\b", + f => "\f", + n => "\n", + r => "\r", + t => "\t" + ); + sub token_string { + return join '', + map { + my $ret = $_; + if( ref $_ ){ + $ret = $rep{ $_->{escaped} }; + $ret ||= pack('U', hex substr $_->{escaped}, 1, 4 + ); + } + $ret; + } @_; + } +}

step 14

out:

#<<0# [ "\" quote \t tab \uD83D\uDC2A U+1F42A DROMEDARY CAMEL" ] { array => ["\" quote \t tab \x{D83D}\x{DC2A} U+1F42A DROMEDARY CAMEL" +], } #>>0#

So far so good. We change input again to see how an array of arrays of strings/numbers parses.

diff:

--- reg +++ reg @@ -135 +135 @@ -for my $str( q{[ "\\" quote \\t tab \\uD83D\\uDC2A U+1F42A DROMEDARY +CAMEL" ]}, ){ +for my $str( q{[ -90.0, [" \\t tab",[7,11]],3e9,"\\r return", null ]} +, ){

step 15

out:

#<<1# [ -90.0, [" \t tab",[7,11]],3e9,"\r return", null ] { array => ["-90.0", [" \t tab", [7, 11]], "3e9", "\r return", undef], } #>>1#

Seems we're done with arrays. Next input is objects.

diff:

--- reg +++ reg @@ -8 +8 @@ -<nocontext: > # Switch off context substring retention +#~ <nocontext: > # Switch off context substring retention @@ -135 +135 @@ -for my $str( q{[ -90.0, [" \\t tab",[7,11]],3e9,"\\r return", null ]} +, ){ +for my $str( q{{ "key" : "lime", "blue": "berry" }}, ){

step 16

out:

#<<0# { "key" : "lime", "blue": "berry" } { "" => "{ \"key\" : \"lime\", \"blue\": \"berry\" }", "object" => { "" => "{ \"key\" : \"lime\", \"blue\": \"berry\" }", "member" => { "" => "\"blue\": \"berry\"", "name_separator" => { "" => ": ", "ws" => " " }, "string" => "blue", "value" => "berry", }, }, } #>>0#

We forgot to remove name_separator in  token: object ( and  token: member) In  token: object the members repeat so we need to save them to an array. In  token: member we collapse <string> and <value>

diff:

--- reg +++ reg @@ -7,3 +7,3 @@ #~ <debug:on> -#~ <nocontext: > # Switch off context substring retention +<nocontext: > # Switch off context substring retention @@ -53,6 +53,9 @@ <token: object> - (?: <.begin_object> (?: (?: <member> (?: (?: <.value_separator><mem +ber> ) )* ) )? <.end_object> ) + (?: <.begin_object> (?: (?: <[member]> (?: (?: <.value_separator><[ +member]> ) )* ) )? <.end_object> ) <token: member> - (?: <string><name_separator><value> ) + (?: <string><.name_separator><value> ) + (?{ + $MATCH = [ $MATCH{string}, $MATCH{value} ]; + })

step 17

out:

#<<0# { "key" : "lime", "blue": "berry" } { object => { member => [["key", "lime"], ["blue", "berry"]] } } #>>0#

Next step, collapse  member and turn it from an array-of-arrays into a hash .

diff:

--- reg +++ reg @@ -53,4 +53,7 @@ <token: object> (?: <.begin_object> (?: (?: <[member]> (?: (?: <.value_separator><[ +member]> ) )* ) )? <.end_object> ) + (?{ + $MATCH = { map { @$_ } @{$MATCH{member}} }; + }) <token: member>

step 18

out:

#<<1# { "key" : "lime", "blue": "berry" } { object => { blue => "berry", key => "lime" } } #>>1#

Next input is a array of object, and a object of array.

diff:

--- reg +++ reg @@ -141 +141 @@ -for my $str( q{{ "key" : "lime", "blue": "berry" }}, ){ +for my $str( q{[{"k":"v"},{"v":"k"}] }, q{{"ro":["sham","bo"],"t":{"i +":{"c":{"t":{"o":"c"}}}}}}){

step 19

out:

#<<1# [{"k":"v"},{"v":"k"}] { array => [{ k => "v" }, { v => "k" }] } #>>1# #<<1# {"ro":["sham","bo"],"t":{"i":{"c":{"t":{"o":"c"}}}}} { object => { ro => ["sham", "bo"], t => { i => { c => { t => { o => " +c" } } } } }, } #>>1#

Is that it, what else is there to do?

test test test

When possible, reduce number of rules, and simplify rules, because perl is not bnf and it will run faster

minimize backtracking (not sure how)

make TRUE/FALSE Readonly references, maybe constant::boolean

make grammar fail , and fail early, on invalid input (not sure how)

learn something from Parse::RecDescent::FAQ/Parse::RecDescent::FAQ::Original

??use Marpa instead (also not sure how)

erroneous input fail5.json takes a long-long time to fail (~0 versus ~40 seconds ) and [] input becomes undef instead of an empty array?

diff:

--- reg +++ reg @@ -141 +141 @@ -for my $str( q{[{"k":"v"},{"v":"k"}] }, q{{"ro":["sham","bo"],"t":{"i +":{"c":{"t":{"o":"c"}}}}}}){ +for my $str( q{[1,[2,[3],[]]]}, q{["double extra comma",,]} , ){

step 20

out:

#<<0# [1,[2,[3],[]]] { array => [1, [2, [3], undef]] } #>>0# #<<0# ["double extra comma",,] ## fail to match #>>43#

That took 40 seconds to fail.

Can't run this under use re 'debug'; I get a segfault.

Both inputs fail to match if you turn on  <debug:on>; I don't understand  <debug:on>

Hopelly you've learned something about ABNF and Regexp::Grammars from this exercise, I did :)

I we I you I me all we

Replies are listed 'Best First'.
Re: RFC: A walkthrough from JSON ABNF to Regexp::Grammars
by Anonymous Monk on Oct 24, 2012 at 10:40 UTC
    Naturally it doesn't handle escaped characters outside the Basic Multilingual Plane

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://963349]
Approved by ww
Front-paged by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (8)
As of 2024-03-28 12:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found