UTF8 matches

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm using UTF8 mode and '\W' to evaluate the UTF8 correctness of documents. Some documents have been corrupted by previous processing. I'm stepping through the UTF8 document using a match such as

use utf8;
$read_buf =~ m/^\W/;
$utf8_char = $&;
[download]

When it gets to an invalid non-UTF8 character the message "Malformed UTF-8 character (byte0xff) in pattern match (m//) ... , but the match does not fail and $& holds the bad characters plus some subsequent bytes. How can I get the match to fail so I can remove the first char in the buffer.

Comment on UTF8 matches Download Code

Replies are listed 'Best First'.
Re: UTF8 matches by chromatic (Archbishop) on Aug 22, 2001 at 23:54 UTC
Check if the match fails. $& and the other magic regex variables aren't cleared on failure: `my $foo = "bar"; $foo =~ /ar/; print "Got $&!\n"; $foo =~ /az/; print "Got $&!\n";` [download] Throw an if block or an and in there and you'll be set: `if ($foo =~ /ar/) { print "Got $&!\n"; } $foo =~ /az/ and print "Got $&!\n";` [download]	[reply] [d/l] [select]
Re: Re: UTF8 matches by Anonymous Monk on Aug 23, 2001 at 01:48 UTC
Thanks for your input. I guess I should have given more description. I tried `if ( $str_read_buf =~ m/^\X/ ) { # do success stuff } else { # fail stuff # processing never gets here }` [download] , but an error was never detected. The m/\X/ detects the invalid byte, because it returns displays the malformed character message, but did not fail. I should have mentioned that this is version 5.6.1.	[reply] [d/l]
Re: UTF8 matches by Anonymous Monk on Aug 22, 2001 at 16:31 UTC
Correction to question. I'm really using '\X' not '\W' for the match pattern. Typos Happen. `use utf8; $read_buf =~ m/^\X/; $utf_char = $&;` [download]	[reply] [d/l]