y9o has asked for the wisdom of the Perl Monks concerning the following question:

This simple perl script hangs in 5.6.1 and 5.8 (in 5.8, without the "use utf8" line).

printf "x\0227 " | perl -e 'use utf8; while (<STDIN>) { s/\d //; }'
Has anyone run into this? Know a work-around? Seen a bug fix?

Note that I realize the input ('x\0227 ') is malformed, but shouldn't perl report "Malformed UTF-8" rather than just hanging?

Note also that there seems to be a rather specific case where this happens--the pattern must be a character class followed by a space. For example, s/[a] //; also hangs. The input must also have a trailing space.

Replies are listed 'Best First'.
Re: pattern match hangs on malformed UTF-8 input
by diotalevi (Canon) on Apr 30, 2003 at 16:17 UTC

    But "x\227 " isn't utf8. I changed your snippet to printf "x\0227 " | perl -MDevel::Peek -e 'while (<STDIN>) { Dump($_) }' (with and without utf8 AS 5.6.1/cygwin 5.8.0) and in no case is your input treated like utf8. Can you write a plain-perl version?

      Hmmm. I don't have an example that is pure perl. I suspect some of the problem stems from the fact that this data is being read in from a file. Regardless of whether Perl is treating the string as UTF-8, does it freeze when you run it? If so, this seems to be a problem--if a file ends with this byte sequence, then perl will hang....

        Change your test case to use Devel::Peek's Dump() routine and show the results from that. We'll know what data you've actually read then. As is, your code runs without any problems.

      I'm confused--are you saying that
      1) when you run this code, it doesn't hang, and
      2) when you run the code, it has different output?

      Can you post the output you get? What about the locale you ar running in? Thanks

        I receive no errors and in general think you're doing something wrong or not giving me the whole story. While these results are from running on perl 5.6.1 on OpenBSD 3.2 using the default locale (C), I had identical behaviour when I used ActiveState 5.6.1 build 633 on Win2K and Cygwin compiled perl 5.8.0 on Win2K (default locale again). If you are the same person who posted the results from Devel::Peek with the GMG/'g' flag then *that* result is certainly odd given the code snippet.

        $ printf "x\0227 " | perl -e 'use utf8; while (<STDIN>) { s/\d //; }' $ printf "x\0227 " | perl -MDevel::Peek -e 'while (<STDIN>) { Dump($_) + }' SV = PV(0x743c) at 0x7108 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x8180 "x\0227 "\0 CUR = 4 LEN = 80 $ printf "x\0227 " | perl fo >x7 < SV = PV(0x743c) at 0x7108 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x8500 "x\0227 "\0 CUR = 4 LEN = 80 hi $
      I'm telling you the whole story (though obviously some piece of information is missing...
      So, here's the output of 'perl -v'. I'm running on RedHat 7.3
      This is perl, v5.6.1 built for i686-linux Copyright 1987-2001, Larry Wall Perl may be copied only under the terms of either the Artistic License + or the GNU General Public License, which may be found in the Perl 5 source ki +t. Complete documentation for Perl, including FAQ lists, should be found +on this system using `man perl' or `perldoc perl'. If you have access to + the Internet, point your browser at http://www.perl.com/, the Perl Home Pa +ge.

      The other thing I noticed is in the output you posted, perl sees '>x7 <', while in my output, perl sees '>x <'
      Not only that, but the PV is different: 'x\0227 ' vs 'x\227 "\0'. It seems like Perl is treating your input as hex 022 followed by the numeral 7 and a space? Whereas Perl is treating my input as hex 227 followed by a space?

      Uh-oh, I figured out the 'g' problem: I had 'use Diagnostics' in there for a while. I pasted my little script in and deleted lines that were commented out and apparently this one wasn't. So, here the exact script and output (still freezes):
      $ more tmp.pl #!/usr/local/bin/perl -w use utf8; use Devel::Peek; while (<STDIN>) { print ">$_<\n"; Dump($_); s/\d //; print "hi\n"; } $ printf "x\0227 " | tmp.pl >x < SV = PV(0x80f4b84) at 0x80f4858 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x80ff148 "x\227 "\0 CUR = 4 LEN = 80

        Sorry about the hiatus. I tried your code and it works just fine. Perhaps you should look and see if something unusual has happened to your operating system - perhaps part of perl has just gone bad.

        >x7 < SV = PV(0x743c) at 0x7108 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x8500 "x\0227 "\0 CUR = 5 LEN = 80 hi
Re: pattern match hangs on malformed UTF-8 input
by Anonymous Monk on May 08, 2003 at 18:08 UTC
    I've tried this on 5 servers with 4 different operating systems, including windows 2000. All perl 5.6.1. Granted, all the servers had perl installed by the same admin, but I installed the windows version myself. This is all too consistent to say perl has "just gone bad". Though I am surprised no one else can reproduce it.

      Now I'm really shooting in the dark. Stick a use re 'debug' at the top of your script and post the results. Maybe that'll tell us something.

Re: pattern match hangs on malformed UTF-8 input
by Anonymous Monk on May 09, 2003 at 14:43 UTC
    We may have it something. Lots of output. I'll start with an example that does NOT hang (because I added an 'x'):
    $printf "x\0227x " | tmp.pl
    
    Compiling REx `^&'
    size 4 first at 2
       1: BOL(2)
       2: EXACT <&>(4)
       4: END(0)
    anchored `&' at 0 (checking anchored) anchored(BOL) minlen 1 
    Compiling REx `\W'
    size 2 first at 1
       1: NALNUM(2)
       2: END(0)
    stclass `NALNUM' minlen 1 
    Using REx substr: `::'
    Guessing start of match, REx `\\/^\\/+$' against `/usr/local/lib/perl5/5.6.1//i686-linux/Devel/Peek.pm'...
    Found floating substr `'$ at offset 52...
    Does not contradict STCLASS...
    Guessed: match at offset 0
    Matching REx `\\/^\\/+$' against `/usr/local/lib/perl5/5.6.1//i686-linux/Devel/Peek.pm'
      Setting an EVAL scope, savestack=307
       0 <> </usr/local/l>    |  1:  ANYOF/\\
       1 </> <usr/local/l>    | 10:  PLUS
                               ANYOF[\0-.0-\-\377] can match 3 times out of 32767...
      Setting an EVAL scope, savestack=307
       4 </usr> </local/l>    | 20:    EOL
                                  failed...
                                failed...
      Setting an EVAL scope, savestack=307
       4 </usr> </local/l>    |  1:  ANYOF/\\
       5 </usr/> <local/l>    | 10:  PLUS
                               ANYOF[\0-.0-\-\377] can match 5 times out of 32767...
      Setting an EVAL scope, savestack=307
      10 <local> </lib/pe>    | 20:    EOL
                                  failed...
                                failed...
      Setting an EVAL scope, savestack=307
      10 <local> </lib/pe>    |  1:  ANYOF/\\
      11 <ocal/> <lib/per>    | 10:  PLUS
                               ANYOF[\0-.0-\-\377] can match 3 times out of 32767...
      Setting an EVAL scope, savestack=307
      14 <l/lib> </perl5/>    | 20:    EOL
                                  failed...
                                failed...
      Setting an EVAL scope, savestack=307
      14 <l/lib> </perl5/>    |  1:  ANYOF/\\
      15 </lib/> <perl5/5>    | 10:  PLUS
                               ANYOF[\0-.0-\-\377] can match 5 times out of 32767...
      Setting an EVAL scope, savestack=307
      20 <perl5> </5.6.1/>    | 20:    EOL
                                  failed...
                                failed...
      Setting an EVAL scope, savestack=307
      20 <perl5> </5.6.1/>    |  1:  ANYOF/\\
      21 <erl5/> <5.6.1//>    | 10:  PLUS
                               ANYOF[\0-.0-\-\377] can match 5 times out of 32767...
      Setting an EVAL scope, savestack=307
      26 <5.6.1> <//i686->    | 20:    EOL
                                  failed...
                                failed...
      Setting an EVAL scope, savestack=307
      26 <5.6.1> <//i686->    |  1:  ANYOF/\\
      27 <.6.1/> </i686-l>    | 10:  PLUS
                               ANYOF[\0-.0-\-\377] can match 0 times out of 32767...
      Setting an EVAL scope, savestack=307
                                failed...
      Setting an EVAL scope, savestack=307
      27 <.6.1/> </i686-l>    |  1:  ANYOF/\\
      28 <6.1//> <i686-li>    | 10:  PLUS
                               ANYOF[\0-.0-\-\377] can match 10 times out of 32767...
      Setting an EVAL scope, savestack=307
      38 <linux> </Devel/>    | 20:    EOL
                                  failed...
                                failed...
      Setting an EVAL scope, savestack=307
      38 <linux> </Devel/>    |  1:  ANYOF/\\
      39 <inux/> <Devel/P>    | 10:  PLUS
                               ANYOF[\0-.0-\-\377] can match 5 times out of 32767...
      Setting an EVAL scope, savestack=307
      44 <Devel> </Peek.p>    | 20:    EOL
                                  failed...
                                failed...
      Setting an EVAL scope, savestack=307
      44 <Devel> </Peek.p>    |  1:  ANYOF/\\
      45 <evel/> <Peek.pm>    | 10:  PLUS
                               ANYOF[\0-.0-\-\377] can match 7 times out of 32767...
      Setting an EVAL scope, savestack=307
      52 <evel/Peek.pm> <>    | 20:    EOL
      52 <evel/Peek.pm> <>    | 21:    END
    Match successful!
    Guessing start of match, REx `\\/^\\/+$' against `/usr/local/lib/perl5/5.6.1//i686-linux/Devel'...
    Found floating substr `'$ at offset 44...
    Does not contradict STCLASS...
    Guessed: match at offset 0
    Matching REx `\\/^\\/+$' against `/usr/local/lib/perl5/5.6.1//i686-linux/Devel'
      Setting an EVAL scope, savestack=304
       0 <> </usr/local/l>    |  1:  ANYOF/\\
       1 </> <usr/local/l>    | 10:  PLUS
                               ANYOF[\0-.0-\-\377] can match 3 times out of 32767...
      Setting an EVAL scope, savestack=304
       4 </usr> </local/l>    | 20:    EOL
                                  failed...
                                failed...
      Setting an EVAL scope, savestack=304
       4 </usr> </local/l>    |  1:  ANYOF/\\
       5 </usr/> <local/l>    | 10:  PLUS
                               ANYOF[\0-.0-\-\377] can match 5 times out of 32767...
      Setting an EVAL scope, savestack=304
      10 <local> </lib/pe>    | 20:    EOL
                                  failed...
                                failed...
      Setting an EVAL scope, savestack=304
      10 <local> </lib/pe>    |  1:  ANYOF/\\
      11 <ocal/> <lib/per>    | 10:  PLUS
                               ANYOF[\0-.0-\-\377] can match 3 times out of 32767...
      Setting an EVAL scope, savestack=304
      14 <l/lib> </perl5/>    | 20:    EOL
                                  failed...
                                failed...
      Setting an EVAL scope, savestack=304
      14 <l/lib> </perl5/>    |  1:  ANYOF/\\
      15 </lib/> <perl5/5>    | 10:  PLUS
                               ANYOF[\0-.0-\-\377] can match 5 times out of 32767...
      Setting an EVAL scope, savestack=304
      20 <perl5> </5.6.1/>    | 20:    EOL
                                  failed...
                                failed...
      Setting an EVAL scope, savestack=304
      20 <perl5> </5.6.1/>    |  1:  ANYOF/\\
      21 <erl5/> <5.6.1//>    | 10:  PLUS
                               ANYOF[\0-.0-\-\377] can match 5 times out of 32767...
      Setting an EVAL scope, savestack=304
      26 <5.6.1> <//i686->    | 20:    EOL
                                  failed...
                                failed...
      Setting an EVAL scope, savestack=304
      26 <5.6.1> <//i686->    |  1:  ANYOF/\\
      27 <.6.1/> </i686-l>    | 10:  PLUS
                               ANYOF[\0-.0-\-\377] can match 0 times out of 32767...
      Setting an EVAL scope, savestack=304
                                failed...
      Setting an EVAL scope, savestack=304
      27 <.6.1/> </i686-l>    |  1:  ANYOF/\\
      28 <6.1//> <i686-li>    | 10:  PLUS
                               ANYOF[\0-.0-\-\377] can match 10 times out of 32767...
      Setting an EVAL scope, savestack=304
      38 <-linux> </Devel>    | 20:    EOL
                                  failed...
                                failed...
      Setting an EVAL scope, savestack=304
      38 <-linux> </Devel>    |  1:  ANYOF/\\
      39 <-linux/> <Devel>    | 10:  PLUS
                               ANYOF[\0-.0-\-\377] can match 5 times out of 32767...
      Setting an EVAL scope, savestack=304
      44 <-linux/Devel> <>    | 20:    EOL
      44 <-linux/Devel> <>    | 21:    END
    Match successful!
    Guessing start of match, REx `(\.\w+)?(;\d*)?$' against `/usr/local/lib/perl5/5.6.1//i686-linux/auto/Devel/Peek/Peek....'...
    Found floating substr `'$ at offset 62...
    Guessed: match at offset 0
    Matching REx `(\.\w+)?(;\d*)?$' against `/usr/local/lib/perl5/5.6.1//i686-linux/auto/Devel/Peek/Peek....'
      Setting an EVAL scope, savestack=308
       0 <> </usr/local/l>    |  1:  CURLYX[0] {0,1}
       0 <> </usr/local/l>    | 11:    WHILEM
                                  0 out of 0..1  cc=bfffeab0
      Setting an EVAL scope, savestack=313
       0 <> </usr/local/l>    |  3:      OPEN1
       0 <> </usr/local/l>    |  5:      EXACT <.>
                                    failed...
         restoring \1..\2 to undef
                                  failed, try continuation...
       0 <> </usr/local/l>    | 12:      NOTHING
       0 <> </usr/local/l>    | 13:      CURLYX1 {0,1}
       0 <> </usr/local/l>    | 23:        WHILEM
                                      0 out of 0..1  cc=bfffe690
      Setting an EVAL scope, savestack=313
       0 <> </usr/local/l>    | 15:          OPEN2
       0 <> </usr/local/l>    | 17:          EXACT <;>
                                        failed...
         restoring \1..\2 to undef
                                      failed, try continuation...
       0 <> </usr/local/l>    | 24:          NOTHING
       0 <> </usr/local/l>    | 25:          EOL
                                        failed...
                                      failed...
                                    failed...
                                  failed...
                                failed...
    
    (Snipped lots of similar looking stuff)
      Setting an EVAL scope, savestack=313
      58 <Peek/Pee> <k.so>    | 15:          OPEN2
      58 <Peek/Pee> <k.so>    | 17:          EXACT <;>
                                        failed...
         restoring \1..\2 to undef
                                      failed, try continuation...
      58 <Peek/Pee> <k.so>    | 24:          NOTHING
      58 <Peek/Pee> <k.so>    | 25:          EOL
                                        failed...
                                      failed...
                                    failed...
                                  failed...
                                failed...
      Setting an EVAL scope, savestack=308
      59 <Peek/Peek> <.so>    |  1:  CURLYX[0] {0,1}
      59 <Peek/Peek> <.so>    | 11:    WHILEM
                                  0 out of 0..1  cc=bfffeab0
      Setting an EVAL scope, savestack=313
      59 <Peek/Peek> <.so>    |  3:      OPEN1
      59 <Peek/Peek> <.so>    |  5:      EXACT <.>
      60 <Peek/Peek.> <so>    |  7:      PLUS
                               ALNUM can match 2 times out of 32767...
      Setting an EVAL scope, savestack=313
      62 <Peek/Peek.so> <>    |  9:        CLOSE1
      62 <Peek/Peek.so> <>    | 11:        WHILEM
                                      1 out of 0..1  cc=bfffeab0
      62 <Peek/Peek.so> <>    | 12:          NOTHING
      62 <Peek/Peek.so> <>    | 13:          CURLYX1 {0,1}
      62 <Peek/Peek.so> <>    | 23:            WHILEM
                                          0 out of 0..1  cc=bfffe270
      Setting an EVAL scope, savestack=318
      62 <Peek/Peek.so> <>    | 15:              OPEN2
      62 <Peek/Peek.so> <>    | 17:              EXACT <;>
                                            failed...
         restoring \2..\2 to undef
                                          failed, try continuation...
      62 <Peek/Peek.so> <>    | 24:              NOTHING
      62 <Peek/Peek.so> <>    | 25:              EOL
      62 <Peek/Peek.so> <>    | 26:              END
    Match successful!
    Matching REx `\W' against `boot_Devel::Peek'
      Setting an EVAL scope, savestack=310
      10 <_Devel> <::Peek>    |  1:  NALNUM
      11 <_Devel:> <:Peek>    |  2:  END
    Match successful!
    Matching REx `\W' against `:Peek'
      Setting an EVAL scope, savestack=310
      11 <_Devel_> <:Peek>    |  1:  NALNUM
      12 <_Devel_:> <Peek>    |  2:  END
    Match successful!
    Matching REx `\W' against `Peek'
    Contradicts stclass...
    Match failed
    Matching REx `\W' against `Dump'
    Contradicts stclass...
    Match failed
    Matching REx `\W' against `mstat'
    Contradicts stclass...
    Match failed
    Matching REx `\W' against `DeadCode'
    Contradicts stclass...
    Match failed
    Matching REx `\W' against `DumpArray'
    Contradicts stclass...
    Match failed
    Matching REx `\W' against `DumpWithOP'
    Contradicts stclass...
    Match failed
    Matching REx `\W' against `DumpProg'
    Contradicts stclass...
    Match failed
    Matching REx `\W' against `fill_mstats'
    Contradicts stclass...
    Match failed
    Matching REx `\W' against `mstats_fillhash'
    Contradicts stclass...
    Match failed
    Matching REx `\W' against `mstats2hash'
    Contradicts stclass...
    Match failed
    Compiling REx `\d '
    Compiling REx `::'
    size 3 first at 1
       1: EXACT <::>(3)
       3: END(0)
    anchored `::' at 0 (checking anchored isall) minlen 2 
    Compiling REx `^(Isn|To)(A-Z.*)'
    size 36 first at 2
       1: BOL(2)
       2: OPEN1(4)
       4:   BRANCH(16)
       5:     EXACT (7)
       7:     ANYOFns(19)
      16:   BRANCH(19)
      17:     EXACT <To>(19)
      19: CLOSE1(21)
      21: OPEN2(23)
      23:   ANYOFA-Z(32)
      32:   STAR(34)
      33:     REG_ANY(0)
      34: CLOSE2(36)
      36: END(0)
    anchored(BOL) minlen 3 
    Compiling REx `^'
    size 2 first at 2
       1: MBOL(2)
       2: END(0)
    stclass `END' anchored(MBOL) minlen 0 
    Matching REx `\W' against `confess'
    Contradicts stclass...
    Match failed
    Matching REx `\W' against `croak'
    Contradicts stclass...
    Match failed
    Matching REx `\W' against `carp'
    Contradicts stclass...
    Match failed
    Compiling REx `^(^=+)='
    size 18 first at 2
    synthetic stclass `ANYOF\0-<>-\377'.
       1: BOL(2)
       2: OPEN1(4)
       4:   PLUS(14)
       5:     ANYOF\0-<>-\377(0)
      14: CLOSE1(16)
      16: EXACT <=>(18)
      18: END(0)
    floating `=' at 1..2147483647 (checking floating) stclass `ANYOF\0-<>-\377' anchored(BOL) minlen 2 
    Compiling REx `^^0-9a-fA-F'
    size 11 first at 2
       1: BOL(2)
       2: ANYOF\0-/:-@G-`g-\377(11)
      11: END(0)
    stclass `ANYOF\0-/:-@G-`g-\377' anchored(BOL) minlen 1 
    Compiling REx `^(0-9a-fA-F+)'
    size 16 first at 2
    synthetic stclass `ANYOF0-9A-Fa-f'.
       1: BOL(2)
       2: OPEN1(4)
       4:   PLUS(14)
       5:     ANYOF0-9A-Fa-f(0)
      14: CLOSE1(16)
      16: END(0)
    stclass `ANYOF0-9A-Fa-f' anchored(BOL) minlen 1 
    Compiling REx `\tXXXX$'
    size 5 first at 1
       1: EXACT <	XXXX>(4)
       4: MEOL(5)
       5: END(0)
    anchored `	XXXX'$ at 0 (checking anchored isall) minlen 5 
    Compiling REx `^(0-9a-fA-F+)(?:\t(0-9a-fA-F+)?)(?:\t(0-9a-fA-F+))?'
    size 56 first at 2
    synthetic stclass `ANYOF0-9A-Fa-f'.
       1: MBOL(2)
       2: OPEN1(4)
       4:   PLUS(14)
       5:     ANYOF0-9A-Fa-f(0)
      14: CLOSE1(16)
      16: EXACT <	>(18)
      18: CURLYX1 {0,1}(35)
      20:   OPEN2(22)
      22:     PLUS(32)
      23:       ANYOF0-9A-Fa-f(0)
      32:   CLOSE2(34)
      34:   WHILEM(0)
      35: NOTHING(36)
      36: CURLYX2 {0,1}(55)
      38:   EXACT <	>(40)
      40:   OPEN3(42)
      42:     PLUS(52)
      43:       ANYOF0-9A-Fa-f(0)
      52:   CLOSE3(54)
      54:   WHILEM(0)
      55: NOTHING(56)
      56: END(0)
    floating `	' at 1..2147483647 (checking floating) stclass `ANYOF0-9A-Fa-f' anchored(MBOL) minlen 2 
    Compiling REx `^(^0-9a-fA-F\n)(.*)'
    size 21 first at 2
    synthetic stclass `ANYOF\0-\11\13-/:-@G-`g-\377'.
       1: MBOL(2)
       2: OPEN1(4)
       4:   ANYOF\0-\11\13-/:-@G-`g-\377(13)
      13: CLOSE1(15)
      15: OPEN2(17)
      17:   STAR(19)
      18:     REG_ANY(0)
      19: CLOSE2(21)
      21: END(0)
    stclass `ANYOF\0-\11\13-/:-@G-`g-\377' anchored(MBOL) minlen 1 
    Compiling REx `-+!'
    size 10 first at 1
       1: ANYOF!+\-(10)
      10: END(0)
    stclass `ANYOF!+\-' minlen 1 
    Compiling REx `::'
    size 3 first at 1
       1: EXACT <::>(3)
       3: END(0)
    anchored `::' at 0 (checking anchored isall) minlen 2 
    Compiling REx `^(0-9a-fA-F+)(?:\t(0-9a-fA-F+)?)(?:\t(0-9a-fA-F+))?'
    size 56 first at 2
    synthetic stclass `ANYOF0-9A-Fa-f'.
       1: MBOL(2)
       2: OPEN1(4)
       4:   PLUS(14)
       5:     ANYOF0-9A-Fa-f(0)
      14: CLOSE1(16)
      16: EXACT <	>(18)
      18: CURLYX1 {0,1}(35)
      20:   OPEN2(22)
      22:     PLUS(32)
      23:       ANYOF0-9A-Fa-f(0)
      32:   CLOSE2(34)
      34:   WHILEM(0)
      35: NOTHING(36)
      36: CURLYX2 {0,1}(55)
      38:   EXACT <	>(40)
      40:   OPEN3(42)
      42:     PLUS(52)
      43:       ANYOF0-9A-Fa-f(0)
      52:   CLOSE3(54)
      54:   WHILEM(0)
      55: NOTHING(56)
      56: END(0)
    floating `	' at 1..2147483647 (checking floating) stclass `ANYOF0-9A-Fa-f' anchored(MBOL) minlen 2 
    Compiling REx `^(0-9a-fA-F+)(?:\t(0-9a-fA-F+))?'
    size 36 first at 2
    synthetic stclass `ANYOF0-9A-Fa-f'.
       1: MBOL(2)
       2: OPEN1(4)
       4:   PLUS(14)
       5:     ANYOF0-9A-Fa-f(0)
      14: CLOSE1(16)
      16: CURLYX1 {0,1}(35)
      18:   EXACT <	>(20)
      20:   OPEN2(22)
      22:     PLUS(32)
      23:       ANYOF0-9A-Fa-f(0)
      32:   CLOSE2(34)
      34:   WHILEM(0)
      35: NOTHING(36)
      36: END(0)
    stclass `ANYOF0-9A-Fa-f' anchored(MBOL) minlen 1 
    Compiling REx `^(-+!)(.*)'
    size 21 first at 2
    synthetic stclass `ANYOF!+\-'.
       1: MBOL(2)
       2: OPEN1(4)
       4:   ANYOF!+\-(13)
      13: CLOSE1(15)
      15: OPEN2(17)
      17:   STAR(19)
      18:     REG_ANY(0)
      19: CLOSE2(21)
      21: END(0)
    stclass `ANYOF!+\-' anchored(MBOL) minlen 1 
    size 4 first at 1
       1: DIGITUTF8(2)
       2: EXACT < >(4)
       4: END(0)
    anchored ` ' at 1 (checking anchored) stclass `DIGITUTF8' minlen 2 
    >x—x <
    SV = PV(0x80f4b84) at 0x80f4858
      REFCNT = 1
      FLAGS = (POK,pPOK)
      PV = 0x811d1e0 "x\227x "\0
      CUR = 4
      LEN = 80
    Guessing start of match, REx `\d ' against `x—x '...
    Found anchored substr ` ' at offset 3...
    Starting position does not contradict /^/m...
    This position contradicts STCLASS...
    Looking for anchored substr starting at offset 4...
    Did not find anchored substr ` '...
    Match rejected by optimizer
    hi
    Freeing REx: `\d '
    
Re: pattern match hangs on malformed UTF-8 input
by Anonymous Monk on May 09, 2003 at 14:44 UTC
    OK, here's the case that failed:
    $ printf "x\0227 " | tmp.pl
    
    Compiling REx `^&'
    size 4 first at 2
       1: BOL(2)
       2: EXACT <&>(4)
       4: END(0)
    anchored `&' at 0 (checking anchored) anchored(BOL) minlen 1 
    Compiling REx `\W'
    size 2 first at 1
       1: NALNUM(2)
       2: END(0)
    stclass `NALNUM' minlen 1 
    Using REx substr: `::'
    Guessing start of match, REx `\\/^\\/+$' against `/usr/local/lib/perl5/5.6.1//i686-linux/Devel/Peek.pm'...
    Found floating substr `'$ at offset 52...
    Does not contradict STCLASS...
    Guessed: match at offset 0
    
    (Snipped up to the part where the regex is actually run because the compilation is exaclty the same, excluding differing memory addresss)
    >x— <
    SV = PV(0x80f4b84) at 0x80f4858
      REFCNT = 1
      FLAGS = (POK,pPOK)
      PV = 0x811d1e0 "x\227 "\0
      CUR = 3
      LEN = 80
    Guessing start of match, REx `\d ' against `x— '...
    Found anchored substr ` ' at offset 2...
    This position contradicts STCLASS...
    Looking for anchored substr starting at offset 2...
    Found anchored substr ` ' at offset 2...
    This position contradicts STCLASS...
    Looking for anchored substr starting at offset 2...
    Found anchored substr ` ' at offset 2...
    This position contradicts STCLASS...
    Looking for anchored substr starting at offset 2...
    Found anchored substr ` ' at offset 2...
    This position contradicts STCLASS...
    Looking for anchored substr starting at offset 2...
    Found anchored substr ` ' at offset 2...
    This position contradicts STCLASS...
    Looking for anchored substr starting at offset 2...
    Found anchored substr ` ' at offset 2...
    This position contradicts STCLASS...
    Looking for anchored substr starting at offset 2...
    Found anchored substr ` ' at offset 2...
    This position contradicts STCLASS...
    Looking for anchored substr starting at offset 2...
    Found anchored substr ` ' at offset 2...
    This position contradicts STCLASS...
    Looking for anchored substr starting at offset 2...
    Found anchored substr ` ' at offset 2...
    This position contradicts STCLASS...
    
    ...etc. this repeats forever

      I can confirm that this hangs for me under perl-5.6.1, but runs fine under 5.8.0.

      The problem is a bug in the regexp optimiser, but I'm not sure off the top of my head which one: #4541 is a possibility. (You can browse the bugs database or look up specific bugs at http://rt.perl.org/perlbug.)

      Working around it doesn't appear to be easy: the best I could come up with was a convoluted attempt to convince the regexp engine that it doesn't know what class the first matched character can be:

      s/(?=\d)\D*\d //;
      which succeeds, and shouldn't be a lot slower than the original pattern would have been without optimisation.

      Hugo
Re: pattern match hangs on malformed UTF-8 input
by Anonymous Monk on May 09, 2003 at 14:48 UTC
    PS:

    Thanks for your help.

    I noticed that the printout of the input data ('>x-- <') is now slightly different, I think this is just due to the fact that I cut-and-pasted the output into a different program while composing the message.