Allinav has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I've managed to write a regex which crashes perl and in diagnosing the issue found another which returns the wrong data in the capture groups. I wrote the code below to test out the regex and the returned capture groups.
#! /usr/bin/perl use warnings; use strict; my $String_to_test = $ARGV[0]; if ( !defined($String_to_test) || $String_to_test eq "") { die "please supply a string to test qr code agaist as first ar +guement\n"; } my $qrtest = eval ( $ARGV[1] ); if ( $@ ) { die "invalid qr supplied at arg2. caused the following error\n +$@\n"; } print "qr=$qrtest\n"; print "string=$String_to_test\n"; if ( $String_to_test =~ $qrtest ) { print "$qrtest present\n"; print "1=$1\n" if defined($1); print "2=$2\n" if defined($2); print "3=$3\n" if defined($3); print "4=$4\n" if defined($4); print "5=$5\n" if defined($5); print "6=$6\n" if defined($6); print "7=$7\n" if defined($7); print "8=$8\n" if defined($8); } else { print "$qrtest NOT present\n"; }
To break the problem down I started with the example from perlre for the branch reset. The qr_test.pl contains the code above.
Case 3 is the broken one. Case 1 and 2 are trying to show the fundamental structure of the regex is OK and Case 4 is a work round.

Case 1: Using example regex from perlre branch reset
./qr_test.pl atuvz 'qr/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) + ( z ) /x' qr=(?x-ism: ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) ) string=atuvz (?x-ism: ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) ) presen +t 1=a 2=t 3=v 4=z
This does what is expected.

Case 2: Added an additional term to 3rd alternative "(w)"
./qr_test.pl atuvwz 'qr/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) +(w) ) ( z ) /x' qr=(?x-ism: ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) (w) ) ( z ) ) string=atuvwz (?x-ism: ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) (w) ) ( z ) ) pr +esent 1=a 2=t 3=v 4=z
This does not do what is expected the 4th capture group should contain w not z and the 5th should contain z

Case 3: Added a fourth term to the 3rd alternative "(x)"
./qr_test.pl atuvwxz 'qr/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) + (w) (x) ) ( z ) /x' qr=(?x-ism: ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) (w) (x) ) ( z + ) ) string=atuvwxz (?x-ism: ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) (w) (x) ) ( z ) +) present 1=a 2=t ./qr_test.sh: line 11: 40158 Segmentation fault
This crashes perl

Case 4: Having see the odd behaviour in case 2 tried adding dummy groups to first alternative.
./qr_test.pl 'atuvwxz' 'qr/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u ( +v) (w) (x) ) ( z ) /x' + ./qr_test.pl atuvwxz 'qr/ ( a ) (?| x ( y ) z () () ()| (p (q) r) | + (t) u (v) (w) (x) ) ( z ) /x' qr=(?x-ism: ( a ) (?| x ( y ) z () () ()| (p (q) r) | (t) u (v) (w) ( +x) ) ( z ) ) string=atuvwxz (?x-ism: ( a ) (?| x ( y ) z () () ()| (p (q) r) | (t) u (v) (w) (x) +) ( z ) ) present 1=a 2=t 3=v 4=w 5=x 6=z
This works so I have a work around.

So my questions are these:
Is this a bug?
Or did I missing something in a manual that says the first alternation must have the most capture groups?
if I did missing something in a manual should perl have warned me rather than crashing? Version of perl I'm using is This is perl, v5.10.1 (*) built for x86_64-linux-thread-multi Fix: Upgrade to a perl 5.022 or later.

Replies are listed 'Best First'.
Re: regex with capture groups and branch reset crashes perl
by choroba (Cardinal) on Dec 11, 2015 at 13:04 UTC
    What Perl version are you using? In 5.22.0, I'm getting no crashes, and the "wrong" output seems to be correct.
    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: regex with capture groups and branch reset crashes perl
by mr_mischief (Monsignor) on Dec 11, 2015 at 14:58 UTC

    I tried this on a few versions.

    These passed (OS X, MacPorts):

    • 5.16.3
    • 5.18.4
    • 5.20.3
    • 5.22.0
    These also passed (CentOS):
    • 5.14.4
    • 5.16.3
    These failed:
    • 5.10.1 on CentOS 6 (stopped at 4, no crash though)
    • 5.8.8 on CentOS 5 (got confused by '(?|' syntax and threw an error)

    This really seems to be something caused by an older version as others have said. I would recommend updating to a newer version and testing.

      • 5.8.8 on CentOS 5 (got confused by '(?|' syntax and threw an error)

      The '(?|' syntax was only introduced with Perl version 5.10.


      Give a man a fish:  <%-{-{-{-<

Re: regex with capture groups and branch reset crashes perl
by Anonymous Monk on Dec 11, 2015 at 14:00 UTC
    Of course, segfault means there is a bug. 5.022 has this in ./re/re_tests:
    # Used to crash, because the last branch was ignored when the parens # were counted: (?|(b)|()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()( +)()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()() +()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()( +)()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()() +()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()( +)()()()()()()()()()()(a))
    Which is kind of similar to your example. Looks like this bug was fixed at some point, I don't know when...