strange behavior of JSON parsing guru regex

seki has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks

Edit:I have improved the ugliness of the code by indenting the output. Behavior is same, except for cases like say '12,34' where i got warnings like 'Negative repeat count does nothing at sscce_rxjs.pl line 43.' showing we seem reenter some of 'value' tests, but without passing via start of the rule.../EDIT

I am working on an evolution of the that astonishing regexp published here to parse JSON in a self-contained recursive descendent extended regexp.

I have integrated it in tool script that is working pretty well, while I only encountered some catastrophic memory consumption issues on some big JSON files due to the parameter passing between rules by stacking in $^R. And like the original code stated, my need is to strive to have the minimum if no external dependency but standard distribution packages. Performance is not an issue, while I will try to improve memory usage on big data structures.

I have reworked the code a bit, commented as possible to document working (it took me a long time to grasp the working of that extended regexp, especially usage of $^R, $^N and code rules). I have also added a lot of string to trace the execution of the regex.
Well formed JSON is parsed successfully, while I do not understand the traces of non well-formed data, and I noticed a worst case that segfaults on OSX (10.13 / 5.22.0 & 5.28.0) while it is working well on Linux with same 5.22.0. plenv is used in both cases.

A correct run:

$ perl sscce_rxjs.pl '{"id":42}'
value?
try object
tuple rule
string rule
->have string 'id'
value?
try object
try number
number rule
->have number 42
post number
end of value
->have tuple
first object pair $VAR1 = [
          
            undef,
            {}
          ,
          'id',
          42
        ];

end of value
$VAR1 = {
          'id' => 42
        };

Now, with an input data of "12,34", while the undef returned is the expected value, I do not understand the origin of all lines following the first 'end of value', while we do not see that a new "value" rule is entered by telling 'value?' we see an additional 'end of value' ans several 'try xx' that are alternatives in the 'value' rule:

$ perl sscce_rxjs.pl '12,34'
value?
try object
try number
number rule
->have number 12
post number
end of value
->have number 1
post number
end of value
try string       <--- EDIT: could it be a behavior of 'value' rule 'or' alternatives being not short-circuited? 
string rule
try array
try true
try false
try null
$VAR1 = undef;

Now with an incorrect input as simple as '"foo",' i am entering the twilight zone with redundant output, and I noticed the behavior is somewhat random. Note that in the production code, the execution segfaults on my mac with a little more complex data like the one available from this test data. I wonder if it could be a side-effect of the eval() to compute the value returned by 'string' rule?

$ perl sscce_rxjs.pl '"foo",'
value?
try object
try number
number rule
try string
string rule
->have string 'foo'
end of value
->have string 'foo'
end of value
->have string 'foo'
end of value
->have string 'foo'
end of value
try array
try true
try false
try null
$VAR1 = undef;

Would you have an explanation, or did I missed some error just not obvious for me?

use strict;
use warnings;
use feature 'say';
use Data::Dumper;
$Data::Dumper::Sortkeys = 1;
$Data::Dumper::Indent   = 2;

my $data = $ARGV[0] || '{"id":42}';

my $o = from_json($data);
say Dumper $o;

sub TRACE_JSON {1}

# Return a Perl structure corresponding to a json string
sub from_json {
    my $rx = qr{
    # NOTES:
    # this regex is a recusrive descent parser - see https://www.perlm
+onks.org/?node_id=995856
    # and chapter 1 "Recursive regular expressions" of Mastering Perl 
+(Brian d Foy)
    #
    # Inside the block (?(DEFINE) ...)  (?<FOOBAR> ...) defines a name
+d pattern FOOBAR
    #                                   that can be called with (?&FOO
+BAR)
    # (?{ ... }) is a block of Perl code that is evaluated at the time
+ we reach it while running the pattern
    # $^R is the value returned by the last runned (?{ }) block
    # $^N is the last matched group

    (?&VALUE) (?{ $_ = $^R->[1] }) # <== entry point of the parser
    
    (?(DEFINE) # this does not try to match, it only defines a serie o
+f named patterns
    
      (?<VALUE> (?{ say 'value?' if TRACE_JSON;$^R })
        \s*
        (
        (?{ say 'try object' if TRACE_JSON;$^R }) (?&OBJECT)
        |
        (?{ say 'try number' if TRACE_JSON;$^R }) (?&NUMBER) (?{ say '
+post number' if TRACE_JSON;$^R })
        |
        (?{ say 'try string' if TRACE_JSON;$^R }) (?&STRING)
        |
        (?{ say 'try array'  if TRACE_JSON;$^R }) (?&ARRAY)
        |
        (?{ say 'try true'  if TRACE_JSON;$^R }) true  (?{ say '->true
+' if TRACE_JSON; [$^R, 1] })
        |
        (?{ say 'try false'  if TRACE_JSON;$^R }) false (?{ say '->fal
+se' if TRACE_JSON; [$^R, 0] })
        |
        (?{ say 'try null'  if TRACE_JSON;$^R }) null  (?{ say '->null
+' if TRACE_JSON; [$^R, undef] })
        ) 
        \s* (?{ say 'end of value' if TRACE_JSON;$^R })
      )
    
      (?<OBJECT> # will generate a Perl hash
        (?{ [$^R, {}] })  # init structure
        \{ # start of object
          \s*
          (?: 
            (?&KV) # [[$^R, {}], $k, $v]   # first pair 
            (?{ say('first object pair ', Dumper($^R)) if TRACE_JSON; 
+[$^R->[0][0], {$^R->[1] => $^R->[2]}] })
      
            (?: # additional pairs 
            \s* , \s* (?&KV) # [[$^R, {...}], $k, $v]
              (?{ say('additional object pair ', Dumper($^R)) if TRACE
+_JSON; [$^R->[0][0], {%{ $^R->[0][1]}, $^R->[1] => $^R->[2]}] })
            )* # additional pairs are optional
          )? # object may be empty
        \}  # end of object
      )
    
      (?<KV>  # tuple <key, value>
        (?{ say 'tuple rule' if TRACE_JSON;$^R })
        (?&STRING) # [$^R, "string"]
        \s* : \s* (?&VALUE) # [[$^R, "string"], $value]
      
        (?{ say '->have tuple' if TRACE_JSON; [$^R->[0][0], $^R->[0][1
+], $^R->[1]] })
      )
    
      (?<ARRAY> # will generate a Perl array
        (?{ [$^R, []] })  # init structure
        \[ # start of array
          (?: 
            (?&VALUE)   # first element 
            (?{ say('first array item ', Dumper($^R)) if TRACE_JSON; [
+$^R->[0][0], [$^R->[1]]] })
      
            (?: # additional elements
            \s* , \s* (?&VALUE) # additional elements
              (?{ say('additional array item ', Dumper($^R)) if TRACE_
+JSON; [$^R->[0][0], [@{$^R->[0][1]}, $^R->[1]]] })
            )* # additional elements are optional
          )? # array may be empty
        \] # end of array
      )
    
      (?<STRING> (?{ say 'string rule' if TRACE_JSON;$^R })
        (
          "
          (?:
            [^\\"]+
          |
            \\ ["\\bfnrt]  # escaped backspace, form feed, newline, ca
+rriage return, tab, \, "
          |
            \\ u [0-9a-fA-F]{4} 
          )*
          "
        )
        (?{ 
            my $s = $^N; 
            $s =~ s/\\u([0-9A-Fa-f]{4})/\\x{$1}/g;
            $s =~ s/@/\\@/g;
            my $v = eval $s;
            say "->have string '$v'" if TRACE_JSON;
            [ $^R, $v ] })
      )
    
      (?<NUMBER> (?{ say 'number rule' if TRACE_JSON;$^R })
        (
          -?
          (?: 0 | [1-9]\d* )
          (?: \. \d+ )?
          (?: [eE] [-+]? \d+ )?
        )
        (?{ my $v = eval $^N;
            say "->have number $v" if TRACE_JSON; 
            [$^R, $v] })
      )
    
    ) #DEFINE
    }xms;
    my $struct;
    {
        local $_ = shift;
        local $^R;
        eval { m{\A$rx\z}; } and $struct = $_;
    }
    return $struct;
}
[download]

The best programs are the ones written when the programmer is supposed to be working on something else. - Melinda Varian

Comment on strange behavior of JSON parsing guru regex Download Code

Replies are listed 'Best First'.
Re: strange behavior of JSON parsing guru regex by 1nickt (Canon) on Apr 16, 2020 at 02:01 UTC
Hi, "...my need is to strive to have the minimum if no external dependency but standard distribution packages..." "...OSX (10.13 / 5.22.0 & 5.28.0) while it is working well on Linux with same 5.22.0..." FWIW: `$ corelist JSON::PP Data for 2019-11-10 JSON::PP was first released with perl v5.13.9` [download] Hope this helps! The way forward always starts with a minimal test.	[reply] [d/l]
Re^2: strange behavior of JSON parsing guru regex by seki (Monk) on Apr 16, 2020 at 10:10 UTC
Thanks for your suggestion, I used 5.12.0 for a long time and was not aware that JSON::PP was incorporated in 5.13. While I mention 5.22 in my tests, one of the targeted platforms, where I have no possibility to use non-core packages has a system Perl of 5.16.3. So it could work. (Of course I could cheat with carton, but as the script could be run if needed on any of thousands of hosts without being installed at first it becomes teddious to deploy some local dependencies prior to run the script.) BUT: my initial goal is to understand the cause of the unexpected behavior. ;o) The best programs are the ones written when the programmer is supposed to be working on something else. - Melinda Varian	[reply]
Re: strange behavior of JSON parsing guru regex by Anonymous Monk on Apr 16, 2020 at 09:56 UTC
For tracing it helps to indent depth Also you should used ratcheting/possessive quantifiers in VALUE	[reply]
Re^2: strange behavior of JSON parsing guru regex by seki (Monk) on Apr 16, 2020 at 12:26 UTC
I tried to handle indentation (see the linked gist in my edit), regex is more ugly. :) I don't follow you about quantifiers: apart the starting an ending optional white spaces, all alternatives of a 'value' is one and only one occurrence of an object, number, string, array, true, false, null. What do you suggest to quantify? The best programs are the ones written when the programmer is supposed to be working on something else. - Melinda Varian	[reply]
Re^3: strange behavior of JSON parsing guru regex by Anonymous Monk on Apr 17, 2020 at 02:11 UTC
I don't follow ... In that case I suggest sticking with a module	[reply]
Re^4: strange behavior of JSON parsing guru regex by seki (Monk) on Apr 15, 2022 at 12:39 UTC
Re: strange behavior of JSON parsing guru regex by Anonymous Monk on Apr 17, 2020 at 03:06 UTC
A TIP `(?{ #~ my $s = $^N; #~ $s =~ s/\\u([0-9A-Fa-f]{4})/\\x{$1}/g; #~ $s =~ s/@/\\@/g; #~ my $v = eval $s; my $v = BONKERS($^N); say "->have string '$v'" if TRACE_JSON; [ $^R, $v ] }) )` [download]	[reply] [d/l]
Re^2: strange behavior of JSON parsing guru regex by seki (Monk) on Apr 19, 2020 at 10:22 UTC
Yes, i wanted to improve somehow the regex-only code by using some helpers to simplify at least the parsing stack management because the $^R concept while powerful is quite tricky and very tedious because of the obligation to finish all (?{}) statements with a final $^R due to the reset of $^R by every (?{ }). I have also put the string evaluation in an external helper. I have the feeling it improved greatly the memory consumption (on my linux it took previously ~10 GB of memory, to parse a 10 MB json file and I experienced some swapping hell with bigger files) but i still experience some crash on malformed json, if you put just a coma at the end of a correct value. Here is the improved test version: use strict; use warnings; use feature 'say'; use utf8; use open ':std', ':encoding(UTF-8)'; use Data::Dumper; $Data::Dumper::Sortkeys = 1; $Data::Dumper::Indent = 0; # compact dump my $data = $ARGV[0] \|\| '{"id":42}'; if (-f $data){ open my $f, '<', $data or die "cannot open $data: $@"; $data = do { local $/; <$f> }; close $f; } my $o = from_json($data); $Data::Dumper::Indent = 2; # fancy dump say Dumper $o; sub TRACE_JSON {1} sub eval_json_string { my $s = shift; $s =~ s/\\u([0-9A-Fa-f]{4})/\\x{$1}/g; $s =~ s/@/\\@/g; return eval $s; } my @eval_stack; sub dump_stack { say "stack is ",scalar(@eval_stack),' =>' , Dumper(\ +@eval_stack) } sub push_val { push @eval_stack, shift; } sub peek_val { my $idx = shift \|\| -1; return $eval_stack[ $idx ]; } sub pop_val { return pop @eval_stack; } sub add_obj_val { my ($k,$v) = @_; $eval_stack[-1]->{$k} = $v; } sub add_arr_val { my $v = shift; push @{$eval_stack[-1]}, $v; } # Return a Perl structure corresponding to a json string sub from_json { my $i = '⋅'; # indent char my $l = 0; # indent level say "Initial stack is ", Dumper(\@eval_stack) if TRACE_JSON; my $rx = qr{ # NOTES: # this regex is a recusrive descent parser - see https://www.perlm +onks.org/?node_id=995856 # and chapter 1 "Recursive regular expressions" of Mastering Perl +(Brian d Foy) # # Inside the block (?(DEFINE) ...) (?<FOOBAR> ...) defines a name +d pattern FOOBAR # that can be called with (?&FOO +BAR) # (?{ ... }) is a block of Perl code that is evaluated at the time + we reach it while running the pattern # $^R is the value returned by the LAST runned (?{ }) block, so it + is overriden at each (?{ ... }) # if you want to run random code, remember to add $^R as last +statement to always keep the value # $^N is the last matched group (?&VALUE) (?{ $_ = pop_val() }) # <== entry point of the parser (?(DEFINE) # this does not try to match, it only defines a serie o +f named patterns (?<VALUE> (?{ say $i x $l++,'Value?' if TRACE_JSON }) \s* ( (?{ say $i x $l,'try object' if TRACE_JSON; $l++; }) (?&OBJECT +) \| (?{ say $i x $l,'try number' if TRACE_JSON; }) (?&NUMBER) (?{ +say $i x $l,'post number' if TRACE_JSON }) \| (?{ say $i x $l,'try string' if TRACE_JSON; }) (?&STRING) \| (?{ say $i x $l,'try array' if TRACE_JSON; $l++; }) (?&ARRAY) + (?{ $l-- }) \| (?{ say $i x $l,'try true' if TRACE_JSON; }) true (?{ say $i + x $l,'->true' if TRACE_JSON; push_val(1) }) \| (?{ say $i x $l,'try false' if TRACE_JSON; }) false (?{ say $ +i x $l,'->false' if TRACE_JSON; push_val(0) }) \| (?{ say $i x $l,'try null' if TRACE_JSON; }) null (?{ say $i + x $l,'->null' if TRACE_JSON; push_val(undef) }) ) \s* (?{ $l--; say ($i x $l,'->have value: ', Dumper(peek_val)) + if TRACE_JSON; }) ) (?<OBJECT> # will generate a Perl hash \{ # start of object (?{ push_val({}); }) # init structure \s* (?: (?&KV) # first pair (?{ say($i x $l,'first object pair ', Dumper([ peek_val(-2 +),peek_val(-1)])) if TRACE_JSON; my $v = pop_val(); my $k = pop_val() +; add_obj_val($k, $v); }) (?: # additional pairs \s* , \s* (?&KV) (?{ say($i x $l,'additional object pair ', Dumper([ peek +_val(-2),peek_val(-1) ])) if TRACE_JSON; my $v = pop_val(); my $k = p +op_val(); add_obj_val($k, $v), }) )* # additional pairs are optional )? # object may be empty \} # end of object ) (?<KV> # tuple <key, value> (?{ say $i x $l,'tuple rule' if TRACE_JSON; $l++; }) (?&STRING) \s* : \s* (?&VALUE) (?{ $l--; say($i x $l,'->have tuple ', Dumper([peek_val(-2),pe +ek_val(-1)]) ) if TRACE_JSON; }) ) (?<ARRAY> # will generate a Perl array \[ # start of array (?{ push_val([]); }) # init structure (?: (?&VALUE) # first element (?{ say($i x $l,'first array item ', peek_val(-1)) if TRAC +E_JSON; my $v = pop_val(); add_arr_val( $v ) }) (?: # additional elements \s* , \s* (?&VALUE) # additional elements (?{ say($i x $l,'additional array item ', peek_val(-1)) +if TRACE_JSON; add_arr_val( pop_val() ) }) )* # additional elements are optional )? # array may be empty \] # end of array (?{ say $i x $l,'->array: ',Dumper(\@eval_st +ack) }) ) (?<STRING> (?{ say $i x $l,'string rule' if TRACE_JSON;$^R }) ( " (?: [^\\"]+ \| \\ ["\\bfnrt] # escaped backspace, form feed, newline, ca +rriage return, tab, \, " \| \\ u [0-9a-fA-F]{4} )* " ) (?{ my $v = eval_json_string($^N); say $i x $l,"->have string '$v'" if TRACE_JSON; push_val($v) }) ) (?<NUMBER> (?{ say $i x $l,'number rule' if TRACE_JSON;$^R }) ( -? (?: 0 \| [1-9]\d* ) (?: \. \d+ )? (?: [eE] [-+]? \d+ )? ) (?{ my $v = eval $^N; say $i x $l,"->have number $v" if TRACE_JSON; push_val($v); }) ) ) #DEFINE }xms; my $struct; { local $_ = shift; local $^R; eval { m{\A$rx\z}; } and $struct = $_; say "eval error: $@" if $@; } return $struct; } [download] The best programs are the ones written when the programmer is supposed to be working on something else. - Melinda Varian	[reply] [d/l]
Re^3: strange behavior of JSON parsing guru regex by Anonymous Monk on Apr 20, 2020 at 03:13 UTC
I have the feeling it improved greatly the memory consumption (on my linux it took previously ~10 GB of memory, to parse a 10 MB json file and I experienced some swapping hell with bigger files) but i still experience some crash on malformed json, if you put just a coma at the end of a correct value. What perl version? Mind your warnings? will not stay shared ... local our $i Use more subroutines? ex `(?{ my $v = eval $^N; say $i x $l,"->have number $v" if TRACE_JSON; push_val($v); }) )` [download] `(?{ push_number( $^N, $depth ); }) sub push_number { ... TRACE( $depth, $msg );` [download] ... </c> Use more "Possessive quantifiers"? For example `(?>(?&STRING)) # ratchet \s+ # ratchet [^\\"]++ # ratchet )+ # ratchet (?: 0 \| [1-9]\d*+ # ratchet` [download]	[reply] [d/l] [select]
Re^4: strange behavior of JSON parsing guru regex by seki (Monk) on Apr 22, 2020 at 09:58 UTC