in reply to Re^2: utf8 characters in tr/// or s///
in thread utf8 characters in tr/// or s///

And one other thing that I'm finding conflicting advice for on the internet is packing the incoming data from CGI into utf8.

I currently use:

$value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;

against the strings that come in via the web.

Now I see that there's a "U" template for unicode. But I'm after UTF8, so that doesn't quite fit, and I don't understand what the pack docs are saying about UTF-8. However, in a couple of places I've searched I've found this:

$value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg; utf8::decode($value);

which I don't really understand. I'd assumed the "C" would put everything into ASCII/ISO-8859-1 and utf8::decoding that would just produce garbage out of the special characters.

What would the monks advise?

Cheers

MattLG

Replies are listed 'Best First'.
Re^4: utf8 characters in tr/// or s///
by graff (Chancellor) on Oct 06, 2008 at 02:20 UTC
    If the stuff coming in from your web clients is using the "%XX" notation for utf8 character data, then any "wide" characters (requiring more than one byte in utf8) will require one "%XX" thingie per byte (e.g. a utf8 "ÿ" (U+00FF) would be "%C3%BF").

    If you see that in your input, then pack("C",...) is the right thing as the first step: it creates the appropriate byte sequence for the intended utf8 character. The utf8::decode() step then handles the necessary step of getting perl to acknowledge that the given byte sequence should be treated as a utf8 character.