Hi Monks

Edit:I have improved the ugliness of the code by indenting the output. Behavior is same, except for cases like say '12,34' where i got warnings like 'Negative repeat count does nothing at sscce_rxjs.pl line 43.' showing we seem reenter some of 'value' tests, but without passing via start of the rule.../EDIT

I am working on an evolution of the that astonishing regexp published here to parse JSON in a self-contained recursive descendent extended regexp.

I have integrated it in tool script that is working pretty well, while I only encountered some catastrophic memory consumption issues on some big JSON files due to the parameter passing between rules by stacking in $^R. And like the original code stated, my need is to strive to have the minimum if no external dependency but standard distribution packages. Performance is not an issue, while I will try to improve memory usage on big data structures.

I have reworked the code a bit, commented as possible to document working (it took me a long time to grasp the working of that extended regexp, especially usage of $^R, $^N and code rules). I have also added a lot of string to trace the execution of the regex.
Well formed JSON is parsed successfully, while I do not understand the traces of non well-formed data, and I noticed a worst case that segfaults on OSX (10.13 / 5.22.0 & 5.28.0) while it is working well on Linux with same 5.22.0. plenv is used in both cases.

A correct run:

$ perl sscce_rxjs.pl '{"id":42}'
value?
try object
tuple rule
string rule
->have string 'id'
value?
try object
try number
number rule
->have number 42
post number
end of value
->have tuple
first object pair $VAR1 = [
          
            undef,
            {}
          ,
          'id',
          42
        ];

end of value
$VAR1 = {
          'id' => 42
        };

Now, with an input data of "12,34", while the undef returned is the expected value, I do not understand the origin of all lines following the first 'end of value', while we do not see that a new "value" rule is entered by telling 'value?' we see an additional 'end of value' ans several 'try xx' that are alternatives in the 'value' rule:
$ perl sscce_rxjs.pl '12,34'
value?
try object
try number
number rule
->have number 12
post number
end of value
->have number 1
post number
end of value
try string       <--- EDIT: could it be a behavior of 'value' rule 'or' alternatives being not short-circuited? 
string rule
try array
try true
try false
try null
$VAR1 = undef;

Now with an incorrect input as simple as '"foo",' i am entering the twilight zone with redundant output, and I noticed the behavior is somewhat random. Note that in the production code, the execution segfaults on my mac with a little more complex data like the one available from this test data. I wonder if it could be a side-effect of the eval() to compute the value returned by 'string' rule?
$ perl sscce_rxjs.pl '"foo",'
value?
try object
try number
number rule
try string
string rule
->have string 'foo'
end of value
->have string 'foo'
end of value
->have string 'foo'
end of value
->have string 'foo'
end of value
try array
try true
try false
try null
$VAR1 = undef;

Would you have an explanation, or did I missed some error just not obvious for me?

use strict; use warnings; use feature 'say'; use Data::Dumper; $Data::Dumper::Sortkeys = 1; $Data::Dumper::Indent = 2; my $data = $ARGV[0] || '{"id":42}'; my $o = from_json($data); say Dumper $o; sub TRACE_JSON {1} # Return a Perl structure corresponding to a json string sub from_json { my $rx = qr{ # NOTES: # this regex is a recusrive descent parser - see https://www.perlm +onks.org/?node_id=995856 # and chapter 1 "Recursive regular expressions" of Mastering Perl +(Brian d Foy) # # Inside the block (?(DEFINE) ...) (?<FOOBAR> ...) defines a name +d pattern FOOBAR # that can be called with (?&FOO +BAR) # (?{ ... }) is a block of Perl code that is evaluated at the time + we reach it while running the pattern # $^R is the value returned by the last runned (?{ }) block # $^N is the last matched group (?&VALUE) (?{ $_ = $^R->[1] }) # <== entry point of the parser (?(DEFINE) # this does not try to match, it only defines a serie o +f named patterns (?<VALUE> (?{ say 'value?' if TRACE_JSON;$^R }) \s* ( (?{ say 'try object' if TRACE_JSON;$^R }) (?&OBJECT) | (?{ say 'try number' if TRACE_JSON;$^R }) (?&NUMBER) (?{ say ' +post number' if TRACE_JSON;$^R }) | (?{ say 'try string' if TRACE_JSON;$^R }) (?&STRING) | (?{ say 'try array' if TRACE_JSON;$^R }) (?&ARRAY) | (?{ say 'try true' if TRACE_JSON;$^R }) true (?{ say '->true +' if TRACE_JSON; [$^R, 1] }) | (?{ say 'try false' if TRACE_JSON;$^R }) false (?{ say '->fal +se' if TRACE_JSON; [$^R, 0] }) | (?{ say 'try null' if TRACE_JSON;$^R }) null (?{ say '->null +' if TRACE_JSON; [$^R, undef] }) ) \s* (?{ say 'end of value' if TRACE_JSON;$^R }) ) (?<OBJECT> # will generate a Perl hash (?{ [$^R, {}] }) # init structure \{ # start of object \s* (?: (?&KV) # [[$^R, {}], $k, $v] # first pair (?{ say('first object pair ', Dumper($^R)) if TRACE_JSON; +[$^R->[0][0], {$^R->[1] => $^R->[2]}] }) (?: # additional pairs \s* , \s* (?&KV) # [[$^R, {...}], $k, $v] (?{ say('additional object pair ', Dumper($^R)) if TRACE +_JSON; [$^R->[0][0], {%{ $^R->[0][1]}, $^R->[1] => $^R->[2]}] }) )* # additional pairs are optional )? # object may be empty \} # end of object ) (?<KV> # tuple <key, value> (?{ say 'tuple rule' if TRACE_JSON;$^R }) (?&STRING) # [$^R, "string"] \s* : \s* (?&VALUE) # [[$^R, "string"], $value] (?{ say '->have tuple' if TRACE_JSON; [$^R->[0][0], $^R->[0][1 +], $^R->[1]] }) ) (?<ARRAY> # will generate a Perl array (?{ [$^R, []] }) # init structure \[ # start of array (?: (?&VALUE) # first element (?{ say('first array item ', Dumper($^R)) if TRACE_JSON; [ +$^R->[0][0], [$^R->[1]]] }) (?: # additional elements \s* , \s* (?&VALUE) # additional elements (?{ say('additional array item ', Dumper($^R)) if TRACE_ +JSON; [$^R->[0][0], [@{$^R->[0][1]}, $^R->[1]]] }) )* # additional elements are optional )? # array may be empty \] # end of array ) (?<STRING> (?{ say 'string rule' if TRACE_JSON;$^R }) ( " (?: [^\\"]+ | \\ ["\\bfnrt] # escaped backspace, form feed, newline, ca +rriage return, tab, \, " | \\ u [0-9a-fA-F]{4} )* " ) (?{ my $s = $^N; $s =~ s/\\u([0-9A-Fa-f]{4})/\\x{$1}/g; $s =~ s/@/\\@/g; my $v = eval $s; say "->have string '$v'" if TRACE_JSON; [ $^R, $v ] }) ) (?<NUMBER> (?{ say 'number rule' if TRACE_JSON;$^R }) ( -? (?: 0 | [1-9]\d* ) (?: \. \d+ )? (?: [eE] [-+]? \d+ )? ) (?{ my $v = eval $^N; say "->have number $v" if TRACE_JSON; [$^R, $v] }) ) ) #DEFINE }xms; my $struct; { local $_ = shift; local $^R; eval { m{\A$rx\z}; } and $struct = $_; } return $struct; }
The best programs are the ones written when the programmer is supposed to be working on something else. - Melinda Varian

In reply to strange behavior of JSON parsing guru regex by seki

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.