Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I thought this should be fairly straight forward. This isn't my first time using regular expressions like this, but it is the first time I've actually wanted to find this type.

I specifically need to find any number (any size) which preceeds an asterisk.
if ( /\d{1,}\*/ ) { print "$_\n"; }
But it does not.
It tells me:
Malformed UTF-8 character (unexpected continuation byte 0xb2, with no preceding start byte) in pattern match (m//) at ./script.pl line 13, <FILE> line xxxx

I tried several variations using parenthesis and whatnot, but I get the same error everytime.

Any help is appreciated.

Replies are listed 'Best First'.
Re: Regex (find an * after a digit)
by NetWallah (Canon) on Feb 06, 2004 at 03:36 UTC
    Googling for the error message reveals this link, which states:

    If you have a newer Perl (5.8 definitely, maybe 5.6 too) and are using a UTF-8 locale, you will get many warnings from perl about "Malformed UTF-8 character" warnings. This is because the HTML output currently uses latin1, which doesn't validate as UTF-8 (unless I am mistaken, this would occur for any input that didn't validate against the locale's encoding even without a UTF-8 locale).

    This problem can be worked around by telling perl to treat the input data as byte data rather than character data by adding the "use bytes;" pragma.

    "When you are faced with a dilemma, might as well make dilemmanade. "

      I *WAS* using "=" instead of "=~" like an idiot. Didn't help much though.

      "use bytes" solved the problem. Strange thing is, this is all on a terminal (no HTML) so not sure why it's doing this. The ONLY thing I can think of different is that the machine w/ the dodgy output has apache running on it, but I'm running this script off to the side.

      Thanks
Re: Regex (find an * after a digit)
by graff (Chancellor) on Feb 06, 2004 at 04:00 UTC
    To expand a bit on NetWallah's remarks, if you are using 5.8.0, and have a locale-aware environment that happens to be set currently to "utf-8", then perl will try to interpret non-ASCII data as if it were utf-8 data by default, and you'd need to expressly tell it not to do that, via the "use bytes" and/or "no utf8" pragmas (or by changing your locale setting). See the "perluniintro" and "perlunicode" man pages if this applies to you.

    You say you are not really doing any utf8 stuff, but it would appear that your text file contains non-ASCII data (e.g. latin1 "accented" characters). If the file uses the "upper table" of iso-8859-1, then the "0xb2" byte is a superscript "2". If it's some other character set, then it's probably some other "special" character. Or maybe it's just noisy data...

    In any case, upgrading to 5.8.2 (the current version) will help. 5.8.0's "interpretive preference" based on the locale setting was ultimately viewed as a bad idea, and was changed in 5.8.1.

      You know what. I never thought about the input. I'm using a file that's created by someone other than me. That person happens to be from another country (that uses the accented characters and eat bagettes)... I'll bet someone a coke that that's it.

      For the record, the machine that it DOES work on, is my machine -- has canna and kinput2 running at all time, and I go in and out of Japanese. The machine it doesn't work on is English only.

      Either way, the text file is an absolute mess, which is why I'm working on it in the first place.

      Cheers to everyone for their input!
Re: Regex (find an * after a digit)
by welchavw (Pilgrim) on Feb 06, 2004 at 02:23 UTC

    AM,

    I believe your regex is ok (your use of $_ may be somewhat mistaken, but that may be just due to a "quickie" example type-up.

    On 5.6.1, under Win32, I got the expected result from

    echo 5* | perl -e "while (<>) {print $_ if /\d{1,}\*/}"

    No answers for you beyond this...(perhaps I missed something???)

    ,welchavw

Re: Regex (find an * after a digit)
by The Mad Hatter (Priest) on Feb 06, 2004 at 02:56 UTC

    Though I know little in the way of Unicode, it seems to me that somewhere Perl thinks something in your regex is trying to specify a UTF-8 character when you aren't; are you using UTF-8 other places in the script? What version of perl are you using?

    Just as a side note, {1,} can be written more succintly as +, like thus: /\d+\*/

      It thinks so, but I'm not using Unicode anywhere. Here's the whole script.
      #!/usr/bin/perl # # use strict; use warnings; my $file = "input.txt"; open ( FILE, $file ) || die "Can't open $file $!"; while ( <FILE> ) { chomp; if ( $_ = /\d+\*/ ) { print "$_\n"; } }
      As for the \d{1,} -- I was actually going to put an upper limit on it at one point, so just wrote it that way. For now the + will do though.

      Update - I tried on another machine, and it works, so it's definitely machine related. Both machines are running Perl v5.8.0 Cheers
        Do you get the same error if you use =~ instead of = ?
        From perl581delta:
        =head2 UTF-8 On Filehandles No Longer Activated By Locale In Perl 5.8.0 all filehandles, including the standard filehandles, were implicitly set to be in Unicode UTF-8 if the locale settings indicated the use of UTF-8. This feature caused too many problems, so the feature was turned off and redesigned: see L</"Core Enhancement +s">.
        Consider upgrading, or switch to a non-UTF8 locale, or binmode FILE.