Re^3: utf8 characters in tr/// or s///

And one other thing that I'm finding conflicting advice for on the internet is packing the incoming data from CGI into utf8.

I currently use:

    $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;
[download]

against the strings that come in via the web.

Now I see that there's a "U" template for unicode. But I'm after UTF8, so that doesn't quite fit, and I don't understand what the pack docs are saying about UTF-8. However, in a couple of places I've searched I've found this:

$value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;
utf8::decode($value);
[download]

which I don't really understand. I'd assumed the "C" would put everything into ASCII/ISO-8859-1 and utf8::decoding that would just produce garbage out of the special characters.

What would the monks advise?

Cheers

MattLG

Comment on Re^3: utf8 characters in tr/// or s/// Select or Download Code

Replies are listed 'Best First'.
Re^4: utf8 characters in tr/// or s/// by graff (Chancellor) on Oct 06, 2008 at 02:20 UTC
If the stuff coming in from your web clients is using the "%XX" notation for utf8 character data, then any "wide" characters (requiring more than one byte in utf8) will require one "%XX" thingie per byte (e.g. a utf8 "ÿ" (U+00FF) would be "%C3%BF"). If you see that in your input, then `pack("C",...)` is the right thing as the first step: it creates the appropriate byte sequence for the intended utf8 character. The utf8::decode() step then handles the necessary step of getting perl to acknowledge that the given byte sequence should be treated as a utf8 character.	[reply] [d/l]

Replies are listed 'Best First'.

Re^4: utf8 characters in tr/// or s///
by graff (Chancellor) on Oct 06, 2008 at 02:20 UTC

If you see that in your input, then pack("C",...) is the right thing as the first step: it creates the appropriate byte sequence for the intended utf8 character. The utf8::decode() step then handles the necessary step of getting perl to acknowledge that the given byte sequence should be treated as a utf8 character.

[reply]
[d/l]