This is a code I've trimmed down from Data::CSel to demonstrate the problem I'm having:
package CSelTest; use 5.020000; use strict; use warnings; our $RE = qr{ (?&ATTR_SELECTOR) (?{ $_ = $^R->[1] }) (?(DEFINE) (?<ATTR_SELECTOR> \[\s* (?{ [$^R, []] }) (?&ATTR_SUBJECTS) (?{ $^R->[0][1][0] = $^R->[1]; $^R->[0]; }) (?: ( \s*=\s*| #\s*!=\s*| # and so on \s+eq\s+ #\s+ne\s+ # and so on ) (?{ my $op = $^N; $op =~ s/^\s+//; $op =~ s/\s+$//; $^R->[1][1] = $op; $^R; }) (?: (?&LITERAL_NUMBER) (?{ $^R->[0][1][2] = $^R->[1]; $^R->[0]; }) ) )? \s*\] ) (?<ATTR_NAME> [A-Za-z_][A-Za-z0-9_]* ) (?<ATTR_SUBJECT> (?{ [$^R, []] }) ((?&ATTR_NAME)) (?{ push @{ $^R->[1] }, $^N; $^R; }) (?: # attribute arguments \s*\(\s* (?{ $^R->[1][1] = []; $^R; }) (?: (?&LITERAL_NUMBER) (?{ push @{ $^R->[0][1][1] }, $^R->[1]; $^R->[0]; }) (?: \s*,\s* (?&LITERAL_NUMBER) (?{ push @{ $^R->[0][1][1] }, $^R->[1]; $^R->[0]; }) )* )? \s*\)\s* )? ) (?<ATTR_SUBJECTS> (?{ [$^R, []] }) (?&ATTR_SUBJECT) (?{ push @{ $^R->[0][1] }, { name => $^R->[1][0], (args => $^R->[1][1]) x !!defined($^R->[1][1 +]), }; $^R->[0]; }) ) (?<LITERAL_NUMBER> ( -? (?: 0 | [1-9]\d* ) (?: \. \d+ )? (?: [eE] [-+]? \d+ )? ) (?{ [$^R, 0+$^N] }) ) ) # DEFINE }x; sub parse_csel { state $re = qr{\A\s*$RE\s*\z}; local $_ = shift; local $^R; eval { $_ =~ $re } and return $_; die $@ if $@; return undef; } 1;

This code tries to parse expression like [attr] or [attr=1] or [attr eq 1] which is similar to the CSS attribute selector.

% perl -I. -Ilib -MCSelTest -MData::Dump -E'dd( CSelTest::parse_csel(q +{ [attr] }) )' [[{ name => "attr" }]] % perl -I. -Ilib -MCSelTest -MData::Dump -E'dd( CSelTest::parse_csel(q +{ [attr=1] }) )' [[{ name => "attr" }], "=", 1] % perl -I. -Ilib -MCSelTest -MData::Dump -E'dd( CSelTest::parse_csel(q +{ [attr eq 1] }) )' [[{ name => "attr" }], "eq", 1]

No problem so far. Now, this code also recognizes the form [meth()] or [meth(1,2,3)] or [meth(1,2,3) = 1], which is recognizing an argument list after the attribute/method name. And this is where the problem happens:

% perl -I. -Ilib -MCSelTest -MData::Dump -E'dd( CSelTest::parse_csel(q +{ [attr()] }) )' [[{ args => [], name => "attr" }]] % perl -I. -Ilib -MCSelTest -MData::Dump -E'dd( CSelTest::parse_csel(q +{ [attr()=1] }) )' [[{ args => [], name => "attr" }], "=", 1] % perl -I. -Ilib -MCSelTest -MData::Dump -E'dd( CSelTest::parse_csel(q +{ [attr() eq 1] }) )' do { my $a = [ [ { args => [], name => "attr" }, # .[0] { args => 'fix', name => "attr" }, # .[1] ], # [0] "eq", # [1] 1, # [2] ]; $a[0][1]{args} = $a[0][0]{args}; $a; }

As you can see, if I use the eq operator, (which is recognized by \s+eq\s+ part in the regex, notice the \s+ instead of \s*) instead of the = operator (which is recognized by \s*=\s* part in the regex, notice the \s* instead of \s+), I'm getting a duplicated section in the result (marked by the # .[1] comment.

I'm using perl 5.22.1 but have tried 5.24.0 as well as 5.25.4, with the same results.

Any hints?

UPDATE 2016-09-10: I worked around this problem by setting and incrementing counter variable in specific places to detect the backtracking and using conditional to avoid my code being executed multiple times in the case of backtracking. Thanks to everyone who provided responses.


In reply to Weirdness (duplicated data) while building result during parsing using regex by perlancar

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.