in reply to pattern match hangs on malformed UTF-8 input

OK, here's the case that failed:
$ printf "x\0227 " | tmp.pl

Compiling REx `^&'
size 4 first at 2
   1: BOL(2)
   2: EXACT <&>(4)
   4: END(0)
anchored `&' at 0 (checking anchored) anchored(BOL) minlen 1 
Compiling REx `\W'
size 2 first at 1
   1: NALNUM(2)
   2: END(0)
stclass `NALNUM' minlen 1 
Using REx substr: `::'
Guessing start of match, REx `\\/^\\/+$' against `/usr/local/lib/perl5/5.6.1//i686-linux/Devel/Peek.pm'...
Found floating substr `'$ at offset 52...
Does not contradict STCLASS...
Guessed: match at offset 0
(Snipped up to the part where the regex is actually run because the compilation is exaclty the same, excluding differing memory addresss)
>x— <
SV = PV(0x80f4b84) at 0x80f4858
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x811d1e0 "x\227 "\0
  CUR = 3
  LEN = 80
Guessing start of match, REx `\d ' against `x— '...
Found anchored substr ` ' at offset 2...
This position contradicts STCLASS...
Looking for anchored substr starting at offset 2...
Found anchored substr ` ' at offset 2...
This position contradicts STCLASS...
Looking for anchored substr starting at offset 2...
Found anchored substr ` ' at offset 2...
This position contradicts STCLASS...
Looking for anchored substr starting at offset 2...
Found anchored substr ` ' at offset 2...
This position contradicts STCLASS...
Looking for anchored substr starting at offset 2...
Found anchored substr ` ' at offset 2...
This position contradicts STCLASS...
Looking for anchored substr starting at offset 2...
Found anchored substr ` ' at offset 2...
This position contradicts STCLASS...
Looking for anchored substr starting at offset 2...
Found anchored substr ` ' at offset 2...
This position contradicts STCLASS...
Looking for anchored substr starting at offset 2...
Found anchored substr ` ' at offset 2...
This position contradicts STCLASS...
Looking for anchored substr starting at offset 2...
Found anchored substr ` ' at offset 2...
This position contradicts STCLASS...
...etc. this repeats forever
  • Comment on Re: pattern match hangs on malformed UTF-8 input

Replies are listed 'Best First'.
Re: Re: pattern match hangs on malformed UTF-8 input
by hv (Prior) on May 09, 2003 at 18:10 UTC

    I can confirm that this hangs for me under perl-5.6.1, but runs fine under 5.8.0.

    The problem is a bug in the regexp optimiser, but I'm not sure off the top of my head which one: #4541 is a possibility. (You can browse the bugs database or look up specific bugs at http://rt.perl.org/perlbug.)

    Working around it doesn't appear to be easy: the best I could come up with was a convoluted attempt to convince the regexp engine that it doesn't know what class the first matched character can be:

    s/(?=\d)\D*\d //;
    which succeeds, and shouldn't be a lot slower than the original pattern would have been without optimisation.

    Hugo