string comparison with escape sequences

danmcb has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: string comparison with escape sequences by almut (Canon) on Aug 30, 2007 at 18:35 UTC
There are of course several ways to do it... but one way would be to convert the non-UTF-8 string to UTF-8, so that you can compare them. (Note that I'm assuming your non-UTF-8 string is literally 'st\x{f9}', not what it would be if you had it written like "st\x{f9}" in some Perl code...) This should do the job: `my $x = 'st\x{f9}'; $x =~ s/\\x\{([\da-fA-F]{2,4})\}/pack("U",hex($1))/ge;` [download] Then, you can do `if ($x eq $u) ...`, presuming that `$u` is your unicode string (with utf8 flag on!). Update: another way would be: `use Encode; my $x = 'st\x{f9}'; my $iso = eval '"'.$x.'"'; $x = Encode::decode('iso-8859-1', $iso);` [download] but I would only use `eval` on an arbitrary string, if I can be absolutely sure it doesn't contain any malicious stuff... Note that the `decode()` is required if your `\x{...}` values are smaller than `0x100`, in which case they will not be unicode after the `eval`. Update 2: actually, my latter statement is not 100% correct (i.e., the `decode` is not strictly required here... but it doesn't do any harm either). Reason is that in a number of cases (like the one here with char `0xf9`), Perl would do an automagic upgrade of the isolatin string. In other words, when comparing an isolatin (byte) string with the corresponding real unicode string, you would in fact get the (naively) expected result that they're equal...	[reply] [d/l] [select]
Re: string comparison with escape sequences by kyle (Abbot) on Aug 30, 2007 at 18:44 UTC
I think the problem is with identifying whether a given series of bytes is utf8 or not. It's not hard to take a string with escape sequences and turn it into the corresponding bytes. Consider: `use Test::More 'no_plan'; my $escaped = 'st\x{f9}'; my $bytes = "st\x{f9}"; ok( $escaped ne $bytes, '$escaped ne $bytes' ); my $unescaped = $escaped; $unescaped =~ s{ \\x\{ ([a-f0-9]{2}) \} }{ chr hex $1 }xmseg; is( $unescaped, $bytes, '$unescaped eq $bytes' );` [download] My question is, what do you do with a string like `"\\x{f9}\x{263a}"`? It contains a utf8 character, and it also contains what looks like an escape sequence. It must be a utf8 string, so it should not be unescaped. Now imagine we chop off that last utf8 character. It's still a utf8 string, but all you see is an escape sequence. In short, if you don't know the character encoding of the bytes you receive, it's hard to interpret them. All that having been said, I doubt a user is going to feed you anything that matches `/\\x\{[a-f0-9]{2}\}/` as a literal string, so it's pretty safe to assume that if that's in the string somewhere, you need to unescape it. Even so, I can't recommend roaming through a stream of bits, whose character encoding you don't know, changing pieces of it, and calling that some kind of progress.	[reply] [d/l] [select]
Re: string comparison with escape sequences by dah (Sexton) on Aug 30, 2007 at 18:20 UTC
Unescape the escape sequences, and then compare.	[reply]