Re^4: Matching UTF8 Regexps

Well, maybe the status of the utf8 flag on the regex could be irrelevant, but if the regex has characters in one encoding, and the string you apply it to is in some other encoding, there's no way it can work as intended.

I don't get it. I thought I said I normalize both to UTF8..? Okay, so I probably wasn't clear. If the following doesn't make clear why I'm confused, then I will just have to admit that I don't know what I'm doing at all...

  # assume all encode/decode works correctly
  my $regexp_raw = "....";
  my $regexp_utf8_decoded = decode($some_enc, $regex_raw);
  my $regexp_utf8_encoded = encode('utf8', $regexp_utf8_decoded);


  my $some_content = "...";
  my $some_content_decoded = decode($some_enc, $some_content);
  my $some_content_encoded = encode('utf8', $some_content_decoded);

  $some_content_decoded =~ /$regexp_utf8_decded/; # matches correctly
  $some_content_decoded =~ /$regexp_utf8_encoded/; # matches correctly
  $some_content_encoded =~ /$regexp_utf8_decoded/; # no
  $some_content_encoded =~ /$regexp_utf8_encoded/; # no
[download]

So why doesn't $some_content_encoded match? I thought all strings are normalized to UTF8...

Comment on Re^4: Matching UTF8 Regexps Download Code

Replies are listed 'Best First'.

Re^5: Matching UTF8 Regexps
by graff (Chancellor) on Mar 08, 2005 at 14:31 UTC

very

$octets = encode("iso-8859-1", $string);
CAVEAT: When you run "$octets = encode("utf8", $string)", then $octets may not be equal to $string. Though they both contain the same data, the utf8 flag for $octets is always off. When you encode anything, utf8 flag of the result is always off, even when it contains completely valid utf8 string.

I was not able to replicate your results, exactly, though I did observe something similar. Here's how it works out, and if this doesn't make sense, I'm at a loss how to explain it better -- it makes sense to me:

#!/usr/bin/perl

use Encode;

$regex_raw = 'много';
$text_raw = 'там очень много в городе, вот этих';

$regex_utf8_d = decode( 'iso-8859-5', $regex_raw );
$text_utf8_d = decode( 'iso-8859-5', $text_raw );

# the "_d" scalars have the utf8 flag ON
# perl will treat their values with character semantics


$regex_utf8_e = encode( 'utf8', $regex_utf8_d );
$text_utf8_e = encode( 'utf8', $text_utf8_d );

# the "_e" scalars have the utf8 flag OFF
# this use of encode is unnecessary and counter-productive; 
# it causes perl to treat the values with byte semantics


@labels = qw/raw-raw dec-dec enc-enc dec-enc enc-dec/;

$match{'raw-raw'} = ($text_raw =~ /$regex_raw/);
$match{'dec-dec'} = ($text_utf8_d =~ /$regex_utf8_d/);
$match{'enc-enc'} = ($text_utf8_e =~ /$regex_utf8_e/);

$match{'dec-enc'} = ($text_utf8_d =~ /$regex_utf8_e/);
$match{'enc-dec'} = ($text_utf8_e =~ /$regex_utf8_d/);

for ( @labels ) {
    print "$_ : $match{$_}\n";
}

__OUTPUT__
raw-raw : 1
dec-dec : 1
enc-enc : 1
dec-enc : 
enc-dec :
[download]

Just stop using the "encode( 'utf8', $decoded_string )" step -- it does you no good, and is the wrong thing to do in your case.

[reply]
[d/l]


go ahead... be a heretic
	PerlMonks