in reply to Recursive regular expression weirdness

diotalevi provided the explanation of the problem. The following is a tested solution:
... our $rxNest; $rxNest = qr{(?x) ( \( [^()]* (?: (??{$rxNest}) [^()]* )* \) ) (?{ [ @{$^R}, $1 ] }) }; our $rxOnlyNested = qr{(?x) (?{ [] }) ^ [^()]* (?: $rxNest [^()]* )* \z (?{ @memoList = @{$^R} }) }; ... if ($string =~ $rxOnlyNested) ...

Note that @memoList is now correct whether there is backtracking or not, thanks to the use of $^R.

Note that I used $1 instead of $+. Using $+ will negatively impact on the performance of all regexps without captures executed in that perl interpreter instance, whereever they are located.

Update: Removed unnecessary parens. (Copy and paste bug.)

Replies are listed 'Best First'.
Re^2: Recursive regular expression weirdness
by johngg (Canon) on Mar 30, 2006 at 11:44 UTC
    Thank you for providing a solution, which I think I understand. Please correct me if I am wrong.

    The (?{ [] }) at the start of $rxOnlyNested creates an empty list reference and the reference will be held in $^R (the result of the last code block). In $rxNest the (?{ [ @{$^R}, $1 ] }) at the end creates a new list reference which contains the dereferenced contents of the last list plus the memory group of the successful match. This list keeps growing as the regular expression recursed, in effect. Finally, in $rxOnlyNested the list is dereferenced and assigned to @memoList.

    I hope I have understood correctly and if it is doing what I think it is doing, it is a very neat solution. I am glad I have finally started visiting the Monastery as I am learning a lot of new techniques.

    Cheers,

    JohnGG

      That's exactly it.

      Now, you might wonder why I didn't add to @memoList directly. It might be best to use an example:

      local $_ = 'abbbbbc'; { local @memoList; / a (?: (b) (?{ push @memoList, $1 }) )* bbc /x; print(@memoList, "\n"); # bbbbb # Wrong! There should only be three! } { local @memoList; / (?{ [] }) a (?: (b) (?{ [ @{$^R}, $1 ] }) )* bbc (?{ @memoList = @{$^R} }) /x; print(@memoList, "\n"); # bbb } { local @temp; local @memoList; / # (?{ @temp = (); }) # Redundant with 'local @temp'. a (?: (b) (?{ local @temp = (@temp, $1); }) )* bbc (?{ @memoList = @temp }) /x; print(@memoList, "\n"); # bbb }

      When the regexp engine finds that it read too many 'b's — rememeber that * is greedy — it backtracks, "unreading" the last 'b' and eventually then a second last 'b'. $^R and @temp (since we keep using local to assign to @temp) are unwound when backtracking occurs, so the last assignment is undone with each "unreading". @memoList is not unwound, on the other hand, so it keeps holding the extra 'b's.

      A quick tests shows that using $^R is *much* faster than using local.

      Update: Removed unnecessary parens. (Copy and paste bug.)

        Thank you again for elaborating on your earlier reply. I can see from your examples how localisation is necessary to avoid, for example, @memoList getting too many 'b's. There is one thing that is confusing me though. In the second and third patterns I don't understand the significance of the extra (memory group?) brackets around the code block (?{...}), as in ((?{ @memoList = @{$^R} })) and ((?{ @memoList = @temp })). Why do we need these when we are assigning within the pattern rather than having to remember outside of it? I must be missing something.

        Cheers,

        JohnGG