Beechbone has asked for the wisdom of the Perl Monks concerning the following question:

I've got a regex that won't match under "use utf8". It works fine without utf8, so I had a look into the "use re debug" output to understand the difference and reason. But, I don't see it---could someone please explain?

This is our Perl:

Summary of my perl5 (revision 5.0 version 6 subversion 1) configuratio +n:
Platform: osname=linux, osvers=2.4.18-3, archname=i386-linux uname='linux grcdg028 2.4.18-3 #1 thu apr 18 07:37:53 edt 2002 i68 +6 unknown ' config_args='-des -Doptimize=-O2 -march=i386 -mcpu=i686 -Dcc=gcc - +Dcf_by=Red Hat, Inc. -Dcccdlflags=-fPIC -Dinstallprefix= sr -Dprefix=/usr -Darchname=i386-linux -Dvendorprefix=/usr -Dsiteprefi +x=/usr -Uusethreads -Uuseithreads -Uuselargefiles -Dd_do id -Dd_semctl_semun -Di_db -Di_ndbm -Di_gdbm -Di_shadow -Di_syslog -Dm +an3ext=3pm' hint=recommended, useposix=true, d_sigaction=define usethreads=undef use5005threads=undef useithreads=undef usemultipl +icity=undef useperlio=undef d_sfio=undef uselargefiles=undef usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef Compiler: cc='gcc', ccflags ='-fno-strict-aliasing -I/usr/local/include', optimize='-O2 -march=i386 -mcpu=i686', cppflags='-fno-strict-aliasing -I/usr/local/include' ccversion='', gccversion='2.96 20000731 (Red Hat Linux 7.3 2.96-11 +0)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=1 +2 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', + lseeksize=4 alignbytes=4, usemymalloc=n, prototype=define Linker and Libraries: ld='gcc', ldflags =' -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib libs=-lnsl -ldl -lm -lc -lcrypt -lutil perllibs=-lnsl -ldl -lm -lc -lcrypt -lutil libc=/lib/libc-2.2.5.so, so=so, useshrplib=false, libperl=libperl. +a Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynami +c' cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib' Characteristics of this binary (from libperl): Compile-time options: Built under linux Compiled at Sep 3 2003 17:48:23 @INC: /usr/lib/perl5/5.6.1/i386-linux /usr/lib/perl5/5.6.1 /usr/lib/perl5/site_perl/5.6.1/i386-linux /usr/lib/perl5/site_perl/5.6.1 /usr/lib/perl5/site_perl /usr/lib/perl5/vendor_perl/5.6.1/i386-linux /usr/lib/perl5/vendor_perl/5.6.1 /usr/lib/perl5/vendor_perl .
This is the regex and the debug output:
Compiling REx `^((?x-ism: [^\?]*? ))/((?x-ism: [^/\?]*? ))$'
size 21 first at 2 1: BOL(2) 2: OPEN1(4) 4: MINMOD(5) 5: STAR(8) 6: ANYOFUTF8[^?...003f 003f](0) 8: CLOSE1(10) 10: EXACT </>(12) 12: OPEN2(14) 14: MINMOD(15) 15: STAR(18) 16: ANYOFUTF8[^/?...002f 002f 003f 003f](0) 18: CLOSE2(20) 20: EOL(21) 21: END(0) floating `/' at 0..2147483647 (checking floating) anchored(BOL) minlen + 1 Guessing start of match, REx `^((?x-ism: [^\?]*? ))/((?x-ism: [^/\?]*? + ))$' against `pf1ad1%/pf2%2Fad2/pdad3/pfad4/filename'..Found floatin +g substr `/' at offset 7... Guessed: match at offset 0 Matching REx `^((?x-ism: [^\?]*? ))/((?x-ism: [^/\?]*? ))$' against `p +f1ad1%/pf2%2Fad2/pdad3/pfad4/filename' Setting an EVAL scope, savestack=9 0 <> <pf1ad1%/pf2%> | 1: BOL 0 <> <pf1ad1%/pf2%> | 2: OPEN1 0 <> <pf1ad1%/pf2%> | 4: MINMOD 0 <> <pf1ad1%/pf2%> | 5: STAR Setting an EVAL scope, savestack=9 0 <> <pf1ad1%/pf2%> | 8: CLOSE1 0 <> <pf1ad1%/pf2%> | 10: EXACT </> failed... ANYOFUTF8[^?...003f 003f] can match 38 tim +es out of 1... 38 <ad4/filename> <> | 8: CLOSE1 38 <ad4/filename> <> | 10: EXACT </> failed... ANYOFUTF8[^?...003f 003f] can match 0 time +s out of 1... failed...
Match failed
Thank you...

Replies are listed 'Best First'.
Re: regex won't match with utf8 enabled...
by kvale (Monsignor) on Sep 24, 2003 at 17:06 UTC
    I cannot comment on perl 5.6.1 behavior, as I don't have a copy. But this matches fine on perl 5.8.0:
    use utf8; use warnings; print "matched\n" if 'pf1ad1%/pf2%2Fad2/pdad3/pfad4/filename' =~ m|^((?x-ism: [^\?]*? ))/((?x-ism: [ +^/\?]*? ))$|;
    so it is possible that you have a bug in 5.6.1.

    One other note: you don't need to escape '?' inside a character class. This works just as well:

    use utf8; use warnings; print "matched\n" if 'pf1ad1%/pf2%2Fad2/pdad3/pfad4/filename' =~ m|^((?x-ism: [^?]*? ))/((?x-ism: [^ +/?]*? ))$|;
    -Mark
      Oops, sorry, I forgot to mention: I tested it under 5.8.0 myself. We all got 6.8.0 on our laptops, but the production machine will run 5.6.1 later on. And it seems, nobody tests his code very often on the development environment... ;-)

      About the escaped '?': I know, but it's maintainable code, so I prefer not to let someone guess if that ? is a ? or something special. That's also the reason for the two qr// which make up this regex. All regexes are simple and are in qr so no quoting subtles can arise...

      BTW: Thanks