Hi Monks
Edit:I have improved the ugliness of the code by indenting the output. Behavior is same, except for cases like say '12,34' where i got warnings like 'Negative repeat count does nothing at sscce_rxjs.pl line 43.' showing we seem reenter some of 'value' tests, but without passing via start of the rule.../EDIT
I am working on an evolution of the that astonishing regexp published here to parse JSON in a self-contained recursive descendent extended regexp.
I have integrated it in tool script that is working pretty well, while I only encountered some catastrophic memory consumption issues on some big JSON files due to the parameter passing between rules by stacking in $^R. And like the original code stated, my need is to strive to have the minimum if no external dependency but standard distribution packages. Performance is not an issue, while I will try to improve memory usage on big data structures.
I have reworked the code a bit, commented as possible to document working (it took me a long time to grasp the working of that extended regexp, especially usage of $^R, $^N and code rules). I have also added a lot of string to trace the execution of the regex.
Well formed JSON is parsed successfully, while I do not understand the traces of non well-formed data, and I noticed a worst case that segfaults on OSX (10.13 / 5.22.0 & 5.28.0) while it is working well on Linux with same 5.22.0. plenv is used in both cases.
A correct run:
$ perl sscce_rxjs.pl '{"id":42}'
value?
try object
tuple rule
string rule
->have string 'id'
value?
try object
try number
number rule
->have number 42
post number
end of value
->have tuple
first object pair $VAR1 = [
undef,
{}
,
'id',
42
];
end of value
$VAR1 = {
'id' => 42
};$ perl sscce_rxjs.pl '12,34' value? try object try number number rule ->have number 12 post number end of value ->have number 1 post number end of value try string <--- EDIT: could it be a behavior of 'value' rule 'or' alternatives being not short-circuited? string rule try array try true try false try null $VAR1 = undef;
$ perl sscce_rxjs.pl '"foo",' value? try object try number number rule try string string rule ->have string 'foo' end of value ->have string 'foo' end of value ->have string 'foo' end of value ->have string 'foo' end of value try array try true try false try null $VAR1 = undef;
Would you have an explanation, or did I missed some error just not obvious for me?
use strict; use warnings; use feature 'say'; use Data::Dumper; $Data::Dumper::Sortkeys = 1; $Data::Dumper::Indent = 2; my $data = $ARGV[0] || '{"id":42}'; my $o = from_json($data); say Dumper $o; sub TRACE_JSON {1} # Return a Perl structure corresponding to a json string sub from_json { my $rx = qr{ # NOTES: # this regex is a recusrive descent parser - see https://www.perlm +onks.org/?node_id=995856 # and chapter 1 "Recursive regular expressions" of Mastering Perl +(Brian d Foy) # # Inside the block (?(DEFINE) ...) (?<FOOBAR> ...) defines a name +d pattern FOOBAR # that can be called with (?&FOO +BAR) # (?{ ... }) is a block of Perl code that is evaluated at the time + we reach it while running the pattern # $^R is the value returned by the last runned (?{ }) block # $^N is the last matched group (?&VALUE) (?{ $_ = $^R->[1] }) # <== entry point of the parser (?(DEFINE) # this does not try to match, it only defines a serie o +f named patterns (?<VALUE> (?{ say 'value?' if TRACE_JSON;$^R }) \s* ( (?{ say 'try object' if TRACE_JSON;$^R }) (?&OBJECT) | (?{ say 'try number' if TRACE_JSON;$^R }) (?&NUMBER) (?{ say ' +post number' if TRACE_JSON;$^R }) | (?{ say 'try string' if TRACE_JSON;$^R }) (?&STRING) | (?{ say 'try array' if TRACE_JSON;$^R }) (?&ARRAY) | (?{ say 'try true' if TRACE_JSON;$^R }) true (?{ say '->true +' if TRACE_JSON; [$^R, 1] }) | (?{ say 'try false' if TRACE_JSON;$^R }) false (?{ say '->fal +se' if TRACE_JSON; [$^R, 0] }) | (?{ say 'try null' if TRACE_JSON;$^R }) null (?{ say '->null +' if TRACE_JSON; [$^R, undef] }) ) \s* (?{ say 'end of value' if TRACE_JSON;$^R }) ) (?<OBJECT> # will generate a Perl hash (?{ [$^R, {}] }) # init structure \{ # start of object \s* (?: (?&KV) # [[$^R, {}], $k, $v] # first pair (?{ say('first object pair ', Dumper($^R)) if TRACE_JSON; +[$^R->[0][0], {$^R->[1] => $^R->[2]}] }) (?: # additional pairs \s* , \s* (?&KV) # [[$^R, {...}], $k, $v] (?{ say('additional object pair ', Dumper($^R)) if TRACE +_JSON; [$^R->[0][0], {%{ $^R->[0][1]}, $^R->[1] => $^R->[2]}] }) )* # additional pairs are optional )? # object may be empty \} # end of object ) (?<KV> # tuple <key, value> (?{ say 'tuple rule' if TRACE_JSON;$^R }) (?&STRING) # [$^R, "string"] \s* : \s* (?&VALUE) # [[$^R, "string"], $value] (?{ say '->have tuple' if TRACE_JSON; [$^R->[0][0], $^R->[0][1 +], $^R->[1]] }) ) (?<ARRAY> # will generate a Perl array (?{ [$^R, []] }) # init structure \[ # start of array (?: (?&VALUE) # first element (?{ say('first array item ', Dumper($^R)) if TRACE_JSON; [ +$^R->[0][0], [$^R->[1]]] }) (?: # additional elements \s* , \s* (?&VALUE) # additional elements (?{ say('additional array item ', Dumper($^R)) if TRACE_ +JSON; [$^R->[0][0], [@{$^R->[0][1]}, $^R->[1]]] }) )* # additional elements are optional )? # array may be empty \] # end of array ) (?<STRING> (?{ say 'string rule' if TRACE_JSON;$^R }) ( " (?: [^\\"]+ | \\ ["\\bfnrt] # escaped backspace, form feed, newline, ca +rriage return, tab, \, " | \\ u [0-9a-fA-F]{4} )* " ) (?{ my $s = $^N; $s =~ s/\\u([0-9A-Fa-f]{4})/\\x{$1}/g; $s =~ s/@/\\@/g; my $v = eval $s; say "->have string '$v'" if TRACE_JSON; [ $^R, $v ] }) ) (?<NUMBER> (?{ say 'number rule' if TRACE_JSON;$^R }) ( -? (?: 0 | [1-9]\d* ) (?: \. \d+ )? (?: [eE] [-+]? \d+ )? ) (?{ my $v = eval $^N; say "->have number $v" if TRACE_JSON; [$^R, $v] }) ) ) #DEFINE }xms; my $struct; { local $_ = shift; local $^R; eval { m{\A$rx\z}; } and $struct = $_; } return $struct; }
In reply to strange behavior of JSON parsing guru regex by seki
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |