danmcb has asked for the wisdom of the Perl Monks concerning the following question:

If I have two strings which represent the same unicode string, like:

stù

and

st\x{f9}

how can I test for string equality in a robust way? (To give some context, one is coming to the app as a CGI parameter, the other froma database. But essentially, either maybe represented as utf8, or as escape sequences. A robust "eq" is needed.)

update: I have tried :

 my $temp = sprintf( "%s", $escaped_string );

to try to lose the escaped chars, without success.

Replies are listed 'Best First'.
Re: string comparison with escape sequences
by almut (Canon) on Aug 30, 2007 at 18:35 UTC

    There are of course several ways to do it... but one way would be to convert the non-UTF-8 string to UTF-8, so that you can compare them. (Note that I'm assuming your non-UTF-8 string is literally 'st\x{f9}', not what it would be if you had it written like "st\x{f9}" in some Perl code...)

    This should do the job:

    my $x = 'st\x{f9}'; $x =~ s/\\x\{([\da-fA-F]{2,4})\}/pack("U",hex($1))/ge;

    Then, you can do if ($x eq $u) ..., presuming that $u is your unicode string (with utf8 flag on!).

    Update: another way would be:

    use Encode; my $x = 'st\x{f9}'; my $iso = eval '"'.$x.'"'; $x = Encode::decode('iso-8859-1', $iso);

    but I would only use eval on an arbitrary string, if I can be absolutely sure it doesn't contain any malicious stuff...

    Note that the decode() is required if your \x{...} values are smaller than 0x100, in which case they will not be unicode after the eval.

    Update 2: actually, my latter statement is not 100% correct (i.e., the decode is not strictly required here... but it doesn't do any harm either). Reason is that in a number of cases (like the one here with char 0xf9), Perl would do an automagic upgrade of the isolatin string. In other words, when comparing an isolatin (byte) string with the corresponding real unicode string, you would in fact get the (naively) expected result that they're equal...

Re: string comparison with escape sequences
by kyle (Abbot) on Aug 30, 2007 at 18:44 UTC

    I think the problem is with identifying whether a given series of bytes is utf8 or not. It's not hard to take a string with escape sequences and turn it into the corresponding bytes. Consider:

    use Test::More 'no_plan'; my $escaped = 'st\x{f9}'; my $bytes = "st\x{f9}"; ok( $escaped ne $bytes, '$escaped ne $bytes' ); my $unescaped = $escaped; $unescaped =~ s{ \\x\{ ([a-f0-9]{2}) \} }{ chr hex $1 }xmseg; is( $unescaped, $bytes, '$unescaped eq $bytes' );

    My question is, what do you do with a string like "\\x{f9}\x{263a}"? It contains a utf8 character, and it also contains what looks like an escape sequence. It must be a utf8 string, so it should not be unescaped. Now imagine we chop off that last utf8 character. It's still a utf8 string, but all you see is an escape sequence.

    In short, if you don't know the character encoding of the bytes you receive, it's hard to interpret them.

    All that having been said, I doubt a user is going to feed you anything that matches /\\x\{[a-f0-9]{2}\}/ as a literal string, so it's pretty safe to assume that if that's in the string somewhere, you need to unescape it. Even so, I can't recommend roaming through a stream of bits, whose character encoding you don't know, changing pieces of it, and calling that some kind of progress.

Re: string comparison with escape sequences
by dah (Sexton) on Aug 30, 2007 at 18:20 UTC
    Unescape the escape sequences, and then compare.