in reply to Re: Matching UTF8 Regexps
in thread Matching UTF8 Regexps

Thanks, but the guessing method is not partinent in this case, as I was testing it with a fixed set of charsets and the same content before posting. :(

I'm just stumped why encode(), which in this case sets the UTF8 flag on, makes any difference. I can understand if it was necessary to make sure that both the regexp and the content had the UTF8 flag on (or off), but in this case it doesn't matter if the UTF8 flag on the regexp is set or not. It matters that the content has the UTF8 flag off. This is the part that I don't get. I'm not failing to convert the content into the correct encoding.

<ASIDE>I only use Encode::Guess as a final resort because more often than not it tends... to utterly fail to work. It's understandable, though, because there are many who decide to gratuitously mix euc-jp, and shift-jis in the same page, for example. Ugh.

Anyhow, guessing is done using about 20 steps for heuristics at this point. if all else fails, we try to decode using Encode::Guess just to see if we can do it, but we don't really rely on it

Replies are listed 'Best First'.
Re^3: Matching UTF8 Regexps
by graff (Chancellor) on Mar 08, 2005 at 04:31 UTC
    I'm sorry, but you don't seem to be making any sense.
    I'm just stumped why encode(), which in this case sets the UTF8 flag on, makes any difference.

    You have it backwards: the parameters given to encode() are a character set spec and a utf8 string; in the typical case, the character set spec is non-unicode, but in any case, the function returns a scalar that contains octets (raw bytes), not "characters" in the perl-internal/utf8 sense, and so the scalar value being returned has the utf8 flag off.

    It's the "decode" function that takes a character-encoding name and a scalar of octets, and returns a scalar string of utf8 characters, with the utf8 flag on.

    I can understand if it was necessary to make sure that both the regexp and the content had the UTF8 flag on (or off), but in this case it doesn't matter if the UTF8 flag on the regexp is set or not.

    Well, maybe the status of the utf8 flag on the regex could be irrelevant, but is important (see later reply), and of course if the regex has characters in one encoding, and the string you apply it to is in some other encoding, there's no way it can work as intended.

    It matters that the content has the UTF8 flag off. This is the part that I don't get. I'm not failing to convert the content into the correct encoding.

    The reason this makes no sense is probably due to your misunderstanding about the roles of encode() and decode() with respect to clearing/setting the utf8 flag. Assuming that you are, in fact, successfully converting (decoding) both the regex and the fetched web-page contents to utf8, then sure, things should work fine, and there should be no mystery about that. (updated to fix grammar)

      Well, maybe the status of the utf8 flag on the regex could be irrelevant, but if the regex has characters in one encoding, and the string you apply it to is in some other encoding, there's no way it can work as intended.

      I don't get it. I thought I said I normalize both to UTF8..? Okay, so I probably wasn't clear. If the following doesn't make clear why I'm confused, then I will just have to admit that I don't know what I'm doing at all...

      # assume all encode/decode works correctly my $regexp_raw = "...."; my $regexp_utf8_decoded = decode($some_enc, $regex_raw); my $regexp_utf8_encoded = encode('utf8', $regexp_utf8_decoded); my $some_content = "..."; my $some_content_decoded = decode($some_enc, $some_content); my $some_content_encoded = encode('utf8', $some_content_decoded); $some_content_decoded =~ /$regexp_utf8_decded/; # matches correctly $some_content_decoded =~ /$regexp_utf8_encoded/; # matches correctly $some_content_encoded =~ /$regexp_utf8_decoded/; # no $some_content_encoded =~ /$regexp_utf8_encoded/; # no

      So why doesn't $some_content_encoded match? I thought all strings are normalized to UTF8...

        Ah. Well, that paragraph of mine that you quoted was not exactly pertinent -- and not correct either. The utf8 flag is very relevant in regex matches, in the following sense: a regex won't match a string unless both the regex and the string have the same value for the utf8 flag (both on or both off). Note this item from "perldoc Encode":
        $octets = encode("iso-8859-1", $string);

        CAVEAT: When you run "$octets = encode("utf8", $string)", then $octets may not be equal to $string. Though they both contain the same data, the utf8 flag for $octets is always off. When you encode anything, utf8 flag of the result is always off, even when it contains completely valid utf8 string.

        I was not able to replicate your results, exactly, though I did observe something similar. Here's how it works out, and if this doesn't make sense, I'm at a loss how to explain it better -- it makes sense to me:

        #!/usr/bin/perl use Encode; $regex_raw = 'много'; $text_raw = 'там очень много в городе, вот этих'; $regex_utf8_d = decode( 'iso-8859-5', $regex_raw ); $text_utf8_d = decode( 'iso-8859-5', $text_raw ); # the "_d" scalars have the utf8 flag ON # perl will treat their values with character semantics $regex_utf8_e = encode( 'utf8', $regex_utf8_d ); $text_utf8_e = encode( 'utf8', $text_utf8_d ); # the "_e" scalars have the utf8 flag OFF # this use of encode is unnecessary and counter-productive; # it causes perl to treat the values with byte semantics @labels = qw/raw-raw dec-dec enc-enc dec-enc enc-dec/; $match{'raw-raw'} = ($text_raw =~ /$regex_raw/); $match{'dec-dec'} = ($text_utf8_d =~ /$regex_utf8_d/); $match{'enc-enc'} = ($text_utf8_e =~ /$regex_utf8_e/); $match{'dec-enc'} = ($text_utf8_d =~ /$regex_utf8_e/); $match{'enc-dec'} = ($text_utf8_e =~ /$regex_utf8_d/); for ( @labels ) { print "$_ : $match{$_}\n"; } __OUTPUT__ raw-raw : 1 dec-dec : 1 enc-enc : 1 dec-enc : enc-dec :
        (I happened to use some randomly-chosen Cyrillic for this example. Naturally, with a multi-byte non-unicode character set like ShiftJIS, you wouldn't want to use the "raw-raw" style of matching, because using bytes instead of characters could lead to false-alarm matches that don't obey character boundaries.)

        Just stop using the "encode( 'utf8', $decoded_string )" step -- it does you no good, and is the wrong thing to do in your case.