Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re^4: Matching UTF8 Regexps

by lestrrat (Deacon)
on Mar 08, 2005 at 07:34 UTC ( [id://437442]=note: print w/replies, xml ) Need Help??


in reply to Re^3: Matching UTF8 Regexps
in thread Matching UTF8 Regexps

Well, maybe the status of the utf8 flag on the regex could be irrelevant, but if the regex has characters in one encoding, and the string you apply it to is in some other encoding, there's no way it can work as intended.

I don't get it. I thought I said I normalize both to UTF8..? Okay, so I probably wasn't clear. If the following doesn't make clear why I'm confused, then I will just have to admit that I don't know what I'm doing at all...

# assume all encode/decode works correctly my $regexp_raw = "...."; my $regexp_utf8_decoded = decode($some_enc, $regex_raw); my $regexp_utf8_encoded = encode('utf8', $regexp_utf8_decoded); my $some_content = "..."; my $some_content_decoded = decode($some_enc, $some_content); my $some_content_encoded = encode('utf8', $some_content_decoded); $some_content_decoded =~ /$regexp_utf8_decded/; # matches correctly $some_content_decoded =~ /$regexp_utf8_encoded/; # matches correctly $some_content_encoded =~ /$regexp_utf8_decoded/; # no $some_content_encoded =~ /$regexp_utf8_encoded/; # no

So why doesn't $some_content_encoded match? I thought all strings are normalized to UTF8...

Replies are listed 'Best First'.
Re^5: Matching UTF8 Regexps
by graff (Chancellor) on Mar 08, 2005 at 14:31 UTC
    Ah. Well, that paragraph of mine that you quoted was not exactly pertinent -- and not correct either. The utf8 flag is very relevant in regex matches, in the following sense: a regex won't match a string unless both the regex and the string have the same value for the utf8 flag (both on or both off). Note this item from "perldoc Encode":
    $octets = encode("iso-8859-1", $string);

    CAVEAT: When you run "$octets = encode("utf8", $string)", then $octets may not be equal to $string. Though they both contain the same data, the utf8 flag for $octets is always off. When you encode anything, utf8 flag of the result is always off, even when it contains completely valid utf8 string.

    I was not able to replicate your results, exactly, though I did observe something similar. Here's how it works out, and if this doesn't make sense, I'm at a loss how to explain it better -- it makes sense to me:

    #!/usr/bin/perl use Encode; $regex_raw = 'много'; $text_raw = 'там очень много в городе, вот этих'; $regex_utf8_d = decode( 'iso-8859-5', $regex_raw ); $text_utf8_d = decode( 'iso-8859-5', $text_raw ); # the "_d" scalars have the utf8 flag ON # perl will treat their values with character semantics $regex_utf8_e = encode( 'utf8', $regex_utf8_d ); $text_utf8_e = encode( 'utf8', $text_utf8_d ); # the "_e" scalars have the utf8 flag OFF # this use of encode is unnecessary and counter-productive; # it causes perl to treat the values with byte semantics @labels = qw/raw-raw dec-dec enc-enc dec-enc enc-dec/; $match{'raw-raw'} = ($text_raw =~ /$regex_raw/); $match{'dec-dec'} = ($text_utf8_d =~ /$regex_utf8_d/); $match{'enc-enc'} = ($text_utf8_e =~ /$regex_utf8_e/); $match{'dec-enc'} = ($text_utf8_d =~ /$regex_utf8_e/); $match{'enc-dec'} = ($text_utf8_e =~ /$regex_utf8_d/); for ( @labels ) { print "$_ : $match{$_}\n"; } __OUTPUT__ raw-raw : 1 dec-dec : 1 enc-enc : 1 dec-enc : enc-dec :
    (I happened to use some randomly-chosen Cyrillic for this example. Naturally, with a multi-byte non-unicode character set like ShiftJIS, you wouldn't want to use the "raw-raw" style of matching, because using bytes instead of characters could lead to false-alarm matches that don't obey character boundaries.)

    Just stop using the "encode( 'utf8', $decoded_string )" step -- it does you no good, and is the wrong thing to do in your case.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://437442]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (2)
As of 2024-04-20 15:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found