What perl version?
As stated, my current working in on 5.28.0 (mac), but I was previously with 5.22.0 (mac/linux), while my needs is to work at least starting with 5.16.3. I am using plenv to switch easily.

Mind your warnings?
I never seen such a warning previously, and not currently. Seems to be related to variables in closures? I will try to understand that point, but my experience shows the problem is not related to this.

Use more subroutines?
Yes, that seems to have an impact on the result: I understand intuitively that when reducing the amount of code inlined in the regex with (?{ ... }), it moves forward the bound of the segfault. Tested by removing all tracing code, moving the most of actions into subs: the parsed that segfaulted with the 10 000 values can now parse it, but it fails now with a 16500 values. (The input file is always a single-line json without whitespaces with an array of bjects. Each object has 60 attributes of string type or null). I am thinking on a way to anonymize values in order to publish test data.
Note that I tried the original parser by Randall Schwartz : it segfaults also on mac with the 16 500 values file.
I wonder if there could have some "hidden trick" that could help to test by increasing the amount of memory allowed to the regex engine, but I don't know the guts parts.

Use more "Possessive quantifiers"?
Why not, while hacking in Perl for some time now, I never used them. I seem to remind it allows to reduce backtracking, right?

Here is the clean version of the parser, closest to the Randall Schwartz version, only using helper subs instead of the uncomprehensible $^R stack hack:

use strict; use warnings; use feature 'say'; use utf8; use open ':std', ':encoding(UTF-8)'; use Data::Dumper; $Data::Dumper::Sortkeys = 1; $Data::Dumper::Indent = 0; # compact dump my $data = $ARGV[0] || '{"id":42}'; if (-f $data){ open my $f, '<', $data or die "cannot open $data: $@"; $data = do { local $/; <$f> }; close $f; } my $o = from_json($data); $Data::Dumper::Indent = 2; # fancy dump say Dumper $o; sub TRACE_JSON {0} sub eval_json_string { my $s = shift; $s =~ s/\\u([0-9A-Fa-f]{4})/\\x{$1}/g; $s =~ s/@/\\@/g; return eval $s; } my @eval_stack; sub dump_stack { say "stack is ",scalar(@eval_stack),' =>' , Dumper(\ +@eval_stack) } sub push_val { push @eval_stack, shift; } sub peek_val { my $idx = shift || -1; return $eval_stack[ $idx ]; } sub pop_val { return pop @eval_stack; } sub add_obj_val { my ($k,$v) = @_; $eval_stack[-1]->{$k} = $v; } sub add_arr_val { my $v = shift; push @{$eval_stack[-1]}, $v; } # Return a Perl structure corresponding to a json string sub from_json { my $i = '&#8901;'; # indent char my $l = 0; # indent level say "Initial stack is ", Dumper(\@eval_stack) if TRACE_JSON; my $rx = qr{ # NOTES: # this regex is a recusrive descent parser - see https://www.perlm +onks.org/?node_id=995856 # and chapter 1 "Recursive regular expressions" of Mastering Perl +(Brian d Foy) # # Inside the block (?(DEFINE) ...) (?<FOOBAR> ...) defines a name +d pattern FOOBAR # that can be called with (?&FOO +BAR) # (?{ ... }) is a block of Perl code that is evaluated at the time + we reach it while running the pattern # $^R is the value returned by the LAST runned (?{ }) block, so it + is overriden at each (?{ ... }) # if you want to run random code, remember to add $^R as last +statement to always keep the value # $^N is the last matched (non-anonymous) group (?&VALUE) (?{ $_ = pop_val() }) # <== entry point of the parser (?(DEFINE) # this does not try to match, it only defines a serie o +f named patterns (?<VALUE> \s*+ ( (?&STRING) | (?&NUMBER) | (?&OBJECT) | (?&ARRAY) | true (?{ push_val(1) }) | false (?{ push_val(0) }) | null (?{ push_val(undef) }) ) \s*+ ) (?<OBJECT> # will generate a Perl hash \{ # start of object (?{ push_val({}); }) # init structure \s*+ (?: (?&KV) # first pair (?{ my $v = pop_val(); my $k = pop_val(); add_obj_val($k, $ +v); }) (?: # additional pairs \s* , \s* (?&KV) (?{ my $v = pop_val(); my $k = pop_val(); add_obj_val($k, + $v); }) )* # additional pairs are optional )? # object may be empty \s*+ \} ) (?<KV> # tuple <key, value> (?&STRING) \s*+ : \s*+ (?&VALUE) (?{ }) ) (?<ARRAY> # will generate a Perl array \[ (?{ push_val([]); }) # init structure \s*+ (?: # first element (?&VALUE) (?{ my $v = pop_val(); add_arr_val( $v ) }) (?: # additional elements \s*+ , \s*+ (?&VALUE) (?{ my $v = pop_val(); add_arr_val( $v ) }) )* # additional elements are optional )? # array may be empty \s*+ \] # end of array ) (?<STRING> ( " (?: [^\\"]+ | \\ ["\\/bfnrt] | \\ u [0-9a-fA-f]{4} )*+ " ) (?{ my $v = eval_json_string($^N); push_val($v); }) ) (?<NUMBER> ( -? (?: 0 | [1-9]\d*+ ) (?: \. \d+ )? (?: [eE] [-+]? \d+ )? ) (?{ my $v = eval $^N; push_val($v); }) ) ) #End of DEFINE }xms; my $struct; { local $_ = shift; local $^R; eval { m{\A$rx\z}; } and $struct = $_; say "eval error: $@" if $@; } return $struct; }

The best programs are the ones written when the programmer is supposed to be working on something else. - Melinda Varian

In reply to Re^4: strange behavior of JSON parsing guru regex by seki
in thread strange behavior of JSON parsing guru regex by seki

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.