in reply to This regex seems to have splattered non-greedy everywhere

It is a bug, but I'm not sure what's causing it. Running it through the debugger, when it finally gets to the (.+) part, it seems to be under the impression that it can only match one character. Changing the first half of the regex fixes this problem, but the fact is the bug remains.

Update: By the way, I came up with a way to use split() for this kind of problem -- splitting on a pattern unless you're inside quotes (or the like). It's ugly, and probably not fit for production, but here it is:

while (<DATA>) { chomp; my $q = 0; # documented for the faint of heart my @fields = split m{ ' # if we match a ' (?{ $q = !$q }) # toggle $q (?!) # and fail (don't split here) | # OR XX # if we match XX (?(?{$q}) # and $q is true (?!) # fail (don't split here) ) # otherwise it succeeds and splits }x; print "[", join("][", @fields), "]\n"; }
Try that on your data. It's sick.

Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart

Replies are listed 'Best First'.
Re^2: This regex seems to have splattered non-greedy everywhere
by fizbin (Chaplain) on Aug 10, 2005 at 18:07 UTC
    That's some impressive abuse of code-eval-during-match, but sadly you're right in that it's not sutiable for production, especially when the real production system is sadly using regular expressions in that other language, and not perl.

    And I'll note that the other language returned the matches I thought it should, although it took time on the order of 2**(size of trailing field). It started getting noticeable when we hit a case where the trailing field was 26 character long...

    -- @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/
      Bug aside, your regex could use some refactoring. I'd suggest:
      m{ ( [^']*? (?: '[^']*' [^']*? )* ) XX | (.+) $ }x
      The unrolled loop of the first half and the $ anchor of the second half should control the performance.

      Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
      How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
        You realize that the unrolled loop is what I used as the equivalent version above, right? In fact, unrolling the loop like this (without the $ anchor) is exactly what I did to the java code to make performance jump back to fast-enough-not-to-matter.
        -- @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/