Re^5: Matching UTF8 Regexps

Ah. Well, that paragraph of mine that you quoted was not exactly pertinent -- and not correct either. The utf8 flag is very relevant in regex matches, in the following sense: a regex won't match a string unless both the regex and the string have the same value for the utf8 flag (both on or both off). Note this item from "perldoc Encode":

$octets = encode("iso-8859-1", $string);
CAVEAT: When you run "$octets = encode("utf8", $string)", then $octets may not be equal to $string. Though they both contain the same data, the utf8 flag for $octets is always off. When you encode anything, utf8 flag of the result is always off, even when it contains completely valid utf8 string.

I was not able to replicate your results, exactly, though I did observe something similar. Here's how it works out, and if this doesn't make sense, I'm at a loss how to explain it better -- it makes sense to me:

#!/usr/bin/perl

use Encode;

$regex_raw = 'много';
$text_raw = 'там очень много в городе, вот этих';

$regex_utf8_d = decode( 'iso-8859-5', $regex_raw );
$text_utf8_d = decode( 'iso-8859-5', $text_raw );

# the "_d" scalars have the utf8 flag ON
# perl will treat their values with character semantics


$regex_utf8_e = encode( 'utf8', $regex_utf8_d );
$text_utf8_e = encode( 'utf8', $text_utf8_d );

# the "_e" scalars have the utf8 flag OFF
# this use of encode is unnecessary and counter-productive; 
# it causes perl to treat the values with byte semantics


@labels = qw/raw-raw dec-dec enc-enc dec-enc enc-dec/;

$match{'raw-raw'} = ($text_raw =~ /$regex_raw/);
$match{'dec-dec'} = ($text_utf8_d =~ /$regex_utf8_d/);
$match{'enc-enc'} = ($text_utf8_e =~ /$regex_utf8_e/);

$match{'dec-enc'} = ($text_utf8_d =~ /$regex_utf8_e/);
$match{'enc-dec'} = ($text_utf8_e =~ /$regex_utf8_d/);

for ( @labels ) {
    print "$_ : $match{$_}\n";
}

__OUTPUT__
raw-raw : 1
dec-dec : 1
enc-enc : 1
dec-enc : 
enc-dec :
[download]

(I happened to use some randomly-chosen Cyrillic for this example. Naturally, with a multi-byte non-unicode character set like ShiftJIS, you wouldn't want to use the "raw-raw" style of matching, because using bytes instead of characters could lead to false-alarm matches that don't obey character boundaries.)

Just stop using the "encode( 'utf8', $decoded_string )" step -- it does you no good, and is the wrong thing to do in your case.

Comment on Re^5: Matching UTF8 Regexps Download Code