comment on

Hi Monks

Edit:I have improved the ugliness of the code by indenting the output. Behavior is same, except for cases like say '12,34' where i got warnings like 'Negative repeat count does nothing at sscce_rxjs.pl line 43.' showing we seem reenter some of 'value' tests, but without passing via start of the rule.../EDIT

I am working on an evolution of the that astonishing regexp published here to parse JSON in a self-contained recursive descendent extended regexp.

I have integrated it in tool script that is working pretty well, while I only encountered some catastrophic memory consumption issues on some big JSON files due to the parameter passing between rules by stacking in $^R. And like the original code stated, my need is to strive to have the minimum if no external dependency but standard distribution packages. Performance is not an issue, while I will try to improve memory usage on big data structures.

I have reworked the code a bit, commented as possible to document working (it took me a long time to grasp the working of that extended regexp, especially usage of $^R, $^N and code rules). I have also added a lot of string to trace the execution of the regex.
Well formed JSON is parsed successfully, while I do not understand the traces of non well-formed data, and I noticed a worst case that segfaults on OSX (10.13 / 5.22.0 & 5.28.0) while it is working well on Linux with same 5.22.0. plenv is used in both cases.

A correct run:

$ perl sscce_rxjs.pl '{"id":42}'
value?
try object
tuple rule
string rule
->have string 'id'
value?
try object
try number
number rule
->have number 42
post number
end of value
->have tuple
first object pair $VAR1 = [
          
            undef,
            {}
          ,
          'id',
          42
        ];

end of value
$VAR1 = {
          'id' => 42
        };

Now, with an input data of "12,34", while the undef returned is the expected value, I do not understand the origin of all lines following the first 'end of value', while we do not see that a new "value" rule is entered by telling 'value?' we see an additional 'end of value' ans several 'try xx' that are alternatives in the 'value' rule:

$ perl sscce_rxjs.pl '12,34'
value?
try object
try number
number rule
->have number 12
post number
end of value
->have number 1
post number
end of value
try string       <--- EDIT: could it be a behavior of 'value' rule 'or' alternatives being not short-circuited? 
string rule
try array
try true
try false
try null
$VAR1 = undef;

Now with an incorrect input as simple as '"foo",' i am entering the twilight zone with redundant output, and I noticed the behavior is somewhat random. Note that in the production code, the execution segfaults on my mac with a little more complex data like the one available from this test data. I wonder if it could be a side-effect of the eval() to compute the value returned by 'string' rule?

$ perl sscce_rxjs.pl '"foo",'
value?
try object
try number
number rule
try string
string rule
->have string 'foo'
end of value
->have string 'foo'
end of value
->have string 'foo'
end of value
->have string 'foo'
end of value
try array
try true
try false
try null
$VAR1 = undef;

Would you have an explanation, or did I missed some error just not obvious for me?

use strict;
use warnings;
use feature 'say';
use Data::Dumper;
$Data::Dumper::Sortkeys = 1;
$Data::Dumper::Indent   = 2;

my $data = $ARGV[0] || '{"id":42}';

my $o = from_json($data);
say Dumper $o;

sub TRACE_JSON {1}

# Return a Perl structure corresponding to a json string
sub from_json {
    my $rx = qr{
    # NOTES:
    # this regex is a recusrive descent parser - see https://www.perlm
+onks.org/?node_id=995856
    # and chapter 1 "Recursive regular expressions" of Mastering Perl 
+(Brian d Foy)
    #
    # Inside the block (?(DEFINE) ...)  (?<FOOBAR> ...) defines a name
+d pattern FOOBAR
    #                                   that can be called with (?&FOO
+BAR)
    # (?{ ... }) is a block of Perl code that is evaluated at the time
+ we reach it while running the pattern
    # $^R is the value returned by the last runned (?{ }) block
    # $^N is the last matched group

    (?&VALUE) (?{ $_ = $^R->[1] }) # <== entry point of the parser
    
    (?(DEFINE) # this does not try to match, it only defines a serie o
+f named patterns
    
      (?<VALUE> (?{ say 'value?' if TRACE_JSON;$^R })
        \s*
        (
        (?{ say 'try object' if TRACE_JSON;$^R }) (?&OBJECT)
        |
        (?{ say 'try number' if TRACE_JSON;$^R }) (?&NUMBER) (?{ say '
+post number' if TRACE_JSON;$^R })
        |
        (?{ say 'try string' if TRACE_JSON;$^R }) (?&STRING)
        |
        (?{ say 'try array'  if TRACE_JSON;$^R }) (?&ARRAY)
        |
        (?{ say 'try true'  if TRACE_JSON;$^R }) true  (?{ say '->true
+' if TRACE_JSON; [$^R, 1] })
        |
        (?{ say 'try false'  if TRACE_JSON;$^R }) false (?{ say '->fal
+se' if TRACE_JSON; [$^R, 0] })
        |
        (?{ say 'try null'  if TRACE_JSON;$^R }) null  (?{ say '->null
+' if TRACE_JSON; [$^R, undef] })
        ) 
        \s* (?{ say 'end of value' if TRACE_JSON;$^R })
      )
    
      (?<OBJECT> # will generate a Perl hash
        (?{ [$^R, {}] })  # init structure
        \{ # start of object
          \s*
          (?: 
            (?&KV) # [[$^R, {}], $k, $v]   # first pair 
            (?{ say('first object pair ', Dumper($^R)) if TRACE_JSON; 
+[$^R->[0][0], {$^R->[1] => $^R->[2]}] })
      
            (?: # additional pairs 
            \s* , \s* (?&KV) # [[$^R, {...}], $k, $v]
              (?{ say('additional object pair ', Dumper($^R)) if TRACE
+_JSON; [$^R->[0][0], {%{ $^R->[0][1]}, $^R->[1] => $^R->[2]}] })
            )* # additional pairs are optional
          )? # object may be empty
        \}  # end of object
      )
    
      (?<KV>  # tuple <key, value>
        (?{ say 'tuple rule' if TRACE_JSON;$^R })
        (?&STRING) # [$^R, "string"]
        \s* : \s* (?&VALUE) # [[$^R, "string"], $value]
      
        (?{ say '->have tuple' if TRACE_JSON; [$^R->[0][0], $^R->[0][1
+], $^R->[1]] })
      )
    
      (?<ARRAY> # will generate a Perl array
        (?{ [$^R, []] })  # init structure
        \[ # start of array
          (?: 
            (?&VALUE)   # first element 
            (?{ say('first array item ', Dumper($^R)) if TRACE_JSON; [
+$^R->[0][0], [$^R->[1]]] })
      
            (?: # additional elements
            \s* , \s* (?&VALUE) # additional elements
              (?{ say('additional array item ', Dumper($^R)) if TRACE_
+JSON; [$^R->[0][0], [@{$^R->[0][1]}, $^R->[1]]] })
            )* # additional elements are optional
          )? # array may be empty
        \] # end of array
      )
    
      (?<STRING> (?{ say 'string rule' if TRACE_JSON;$^R })
        (
          "
          (?:
            [^\\"]+
          |
            \\ ["\\bfnrt]  # escaped backspace, form feed, newline, ca
+rriage return, tab, \, "
          |
            \\ u [0-9a-fA-F]{4} 
          )*
          "
        )
        (?{ 
            my $s = $^N; 
            $s =~ s/\\u([0-9A-Fa-f]{4})/\\x{$1}/g;
            $s =~ s/@/\\@/g;
            my $v = eval $s;
            say "->have string '$v'" if TRACE_JSON;
            [ $^R, $v ] })
      )
    
      (?<NUMBER> (?{ say 'number rule' if TRACE_JSON;$^R })
        (
          -?
          (?: 0 | [1-9]\d* )
          (?: \. \d+ )?
          (?: [eE] [-+]? \d+ )?
        )
        (?{ my $v = eval $^N;
            say "->have number $v" if TRACE_JSON; 
            [$^R, $v] })
      )
    
    ) #DEFINE
    }xms;
    my $struct;
    {
        local $_ = shift;
        local $^R;
        eval { m{\A$rx\z}; } and $struct = $_;
    }
    return $struct;
}
[download]

The best programs are the ones written when the programmer is supposed to be working on something else. - Melinda Varian

In reply to strange behavior of JSON parsing guru regex by seki

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.