in reply to Re^4: Matching UTF8 Regexps
in thread Matching UTF8 Regexps

Ah. Well, that paragraph of mine that you quoted was not exactly pertinent -- and not correct either. The utf8 flag is very relevant in regex matches, in the following sense: a regex won't match a string unless both the regex and the string have the same value for the utf8 flag (both on or both off). Note this item from "perldoc Encode":
$octets = encode("iso-8859-1", $string);

CAVEAT: When you run "$octets = encode("utf8", $string)", then $octets may not be equal to $string. Though they both contain the same data, the utf8 flag for $octets is always off. When you encode anything, utf8 flag of the result is always off, even when it contains completely valid utf8 string.

I was not able to replicate your results, exactly, though I did observe something similar. Here's how it works out, and if this doesn't make sense, I'm at a loss how to explain it better -- it makes sense to me:

#!/usr/bin/perl use Encode; $regex_raw = 'много'; $text_raw = 'там очень много в городе, вот этих'; $regex_utf8_d = decode( 'iso-8859-5', $regex_raw ); $text_utf8_d = decode( 'iso-8859-5', $text_raw ); # the "_d" scalars have the utf8 flag ON # perl will treat their values with character semantics $regex_utf8_e = encode( 'utf8', $regex_utf8_d ); $text_utf8_e = encode( 'utf8', $text_utf8_d ); # the "_e" scalars have the utf8 flag OFF # this use of encode is unnecessary and counter-productive; # it causes perl to treat the values with byte semantics @labels = qw/raw-raw dec-dec enc-enc dec-enc enc-dec/; $match{'raw-raw'} = ($text_raw =~ /$regex_raw/); $match{'dec-dec'} = ($text_utf8_d =~ /$regex_utf8_d/); $match{'enc-enc'} = ($text_utf8_e =~ /$regex_utf8_e/); $match{'dec-enc'} = ($text_utf8_d =~ /$regex_utf8_e/); $match{'enc-dec'} = ($text_utf8_e =~ /$regex_utf8_d/); for ( @labels ) { print "$_ : $match{$_}\n"; } __OUTPUT__ raw-raw : 1 dec-dec : 1 enc-enc : 1 dec-enc : enc-dec :
(I happened to use some randomly-chosen Cyrillic for this example. Naturally, with a multi-byte non-unicode character set like ShiftJIS, you wouldn't want to use the "raw-raw" style of matching, because using bytes instead of characters could lead to false-alarm matches that don't obey character boundaries.)

Just stop using the "encode( 'utf8', $decoded_string )" step -- it does you no good, and is the wrong thing to do in your case.