Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm using UTF8 mode and '\W' to evaluate the UTF8 correctness of documents. Some documents have been corrupted by previous processing. I'm stepping through the UTF8 document using a match such as
use utf8; $read_buf =~ m/^\W/; $utf8_char = $&;
When it gets to an invalid non-UTF8 character the message "Malformed UTF-8 character (byte0xff) in pattern match (m//) ... , but the match does not fail and $& holds the bad characters plus some subsequent bytes. How can I get the match to fail so I can remove the first char in the buffer.

Replies are listed 'Best First'.
Re: UTF8 matches
by chromatic (Archbishop) on Aug 22, 2001 at 23:54 UTC
    Check if the match fails. $& and the other magic regex variables aren't cleared on failure:
    my $foo = "bar"; $foo =~ /ar/; print "Got $&!\n"; $foo =~ /az/; print "Got $&!\n";
    Throw an if block or an and in there and you'll be set:
    if ($foo =~ /ar/) { print "Got $&!\n"; } $foo =~ /az/ and print "Got $&!\n";
      Thanks for your input. I guess I should have given more description. I tried
      if ( $str_read_buf =~ m/^\X/ ) { # do success stuff } else { # fail stuff # processing never gets here }
      , but an error was never detected. The m/\X/ detects the invalid byte, because it returns displays the malformed character message, but did not fail. I should have mentioned that this is version 5.6.1.
Re: UTF8 matches
by Anonymous Monk on Aug 22, 2001 at 16:31 UTC
    Correction to question. I'm really using '\X' not '\W' for the match pattern. Typos Happen.
    use utf8; $read_buf =~ m/^\X/; $utf_char = $&;