in reply to pattern match hangs on malformed UTF-8 input

But "x\227 " isn't utf8. I changed your snippet to printf "x\0227 " | perl -MDevel::Peek -e 'while (<STDIN>) { Dump($_) }' (with and without utf8 AS 5.6.1/cygwin 5.8.0) and in no case is your input treated like utf8. Can you write a plain-perl version?

Replies are listed 'Best First'.
Re: Re: pattern match hangs on malformed UTF-8 input
by y9o (Initiate) on Apr 30, 2003 at 18:00 UTC
    Hmmm. I don't have an example that is pure perl. I suspect some of the problem stems from the fact that this data is being read in from a file. Regardless of whether Perl is treating the string as UTF-8, does it freeze when you run it? If so, this seems to be a problem--if a file ends with this byte sequence, then perl will hang....

      Change your test case to use Devel::Peek's Dump() routine and show the results from that. We'll know what data you've actually read then. As is, your code runs without any problems.

        Here's some new code. Same thing, only Dump and print are used.
        You can change STDIN to a file handle opened on a file. As long as the last two characters are 0x227 and " ", then "hi" is never printed.
        use utf8; use Devel::Peek; while (<STDIN>) { print ">$_<\n"; Dump($_); s/\d //; print "hi\n"; }
        and here's the output:
        >x < SV = PVMG(0x8101240) at 0x80f4858 REFCNT = 1 FLAGS = (SMG,POK,pPOK) IV = 0 NV = 0 PV = 0x80ff368 "x\227 "\0 CUR = 3 LEN = 80 MAGIC = 0x81429f8 MG_VIRTUAL = &PL_vtbl_mglob MG_TYPE = 'g' MG_LEN = -1
        (...and the script doesn't exit)
Re: Re: pattern match hangs on malformed UTF-8 input
by y9o (Initiate) on Apr 30, 2003 at 22:08 UTC
    I'm confused--are you saying that
    1) when you run this code, it doesn't hang, and
    2) when you run the code, it has different output?

    Can you post the output you get? What about the locale you ar running in? Thanks

      I receive no errors and in general think you're doing something wrong or not giving me the whole story. While these results are from running on perl 5.6.1 on OpenBSD 3.2 using the default locale (C), I had identical behaviour when I used ActiveState 5.6.1 build 633 on Win2K and Cygwin compiled perl 5.8.0 on Win2K (default locale again). If you are the same person who posted the results from Devel::Peek with the GMG/'g' flag then *that* result is certainly odd given the code snippet.

      $ printf "x\0227 " | perl -e 'use utf8; while (<STDIN>) { s/\d //; }' $ printf "x\0227 " | perl -MDevel::Peek -e 'while (<STDIN>) { Dump($_) + }' SV = PV(0x743c) at 0x7108 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x8180 "x\0227 "\0 CUR = 4 LEN = 80 $ printf "x\0227 " | perl fo >x7 < SV = PV(0x743c) at 0x7108 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x8500 "x\0227 "\0 CUR = 4 LEN = 80 hi $
Re: Re: pattern match hangs on malformed UTF-8 input
by y9o (Initiate) on May 01, 2003 at 13:45 UTC
    I'm telling you the whole story (though obviously some piece of information is missing...
    So, here's the output of 'perl -v'. I'm running on RedHat 7.3
    This is perl, v5.6.1 built for i686-linux Copyright 1987-2001, Larry Wall Perl may be copied only under the terms of either the Artistic License + or the GNU General Public License, which may be found in the Perl 5 source ki +t. Complete documentation for Perl, including FAQ lists, should be found +on this system using `man perl' or `perldoc perl'. If you have access to + the Internet, point your browser at http://www.perl.com/, the Perl Home Pa +ge.

    The other thing I noticed is in the output you posted, perl sees '>x7 <', while in my output, perl sees '>x <'
    Not only that, but the PV is different: 'x\0227 ' vs 'x\227 "\0'. It seems like Perl is treating your input as hex 022 followed by the numeral 7 and a space? Whereas Perl is treating my input as hex 227 followed by a space?

    Uh-oh, I figured out the 'g' problem: I had 'use Diagnostics' in there for a while. I pasted my little script in and deleted lines that were commented out and apparently this one wasn't. So, here the exact script and output (still freezes):
    $ more tmp.pl #!/usr/local/bin/perl -w use utf8; use Devel::Peek; while (<STDIN>) { print ">$_<\n"; Dump($_); s/\d //; print "hi\n"; } $ printf "x\0227 " | tmp.pl >x < SV = PV(0x80f4b84) at 0x80f4858 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x80ff148 "x\227 "\0 CUR = 4 LEN = 80

      Sorry about the hiatus. I tried your code and it works just fine. Perhaps you should look and see if something unusual has happened to your operating system - perhaps part of perl has just gone bad.

      >x7 < SV = PV(0x743c) at 0x7108 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x8500 "x\0227 "\0 CUR = 5 LEN = 80 hi