Conversion of Extended Characters

aceofspace has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Conversion of Extended Characters by moritz (Cardinal) on Dec 28, 2010 at 16:44 UTC
See Percent encoding URIs in Perl, which discusses several possibilities. Perl 6 - second systems done right	[reply]
Re^2: Conversion of Extended Characters by aceofspace (Acolyte) on Dec 29, 2010 at 08:22 UTC
I've tried URI::Encode. It does not work for Extended Characters. I need something that will work for Extended Characters. Any suggestions?	[reply]
Re^3: Conversion of Extended Characters by moritz (Cardinal) on Dec 29, 2010 at 08:44 UTC
The article I linked to shows how several modules handles non-ASCII characters (which I believe are what you call "extended", though I have no idea why). Is there any reason you ignored it, or haven't read it thoroughly? Perl 6 - second systems done right	[reply]
Re: Conversion of Extended Characters by JavaFan (Canon) on Dec 28, 2010 at 16:38 UTC
`use CGI;` [download] If you then use the `param` method, all the escaping will be done for you.	[reply] [d/l] [select]
Re^2: Conversion of Extended Characters by ikegami (Patriarch) on Dec 28, 2010 at 23:28 UTC
CGI removes the URL-encoding, but IIRC, CGI leaves the character encoding in place. If so, you can use Encode's `decode_utf8` or utf8's `decode` on what `param` returns.	[reply] [d/l] [select]
Re^3: Conversion of Extended Characters by afoken (Chancellor) on Dec 29, 2010 at 13:03 UTC
CGI has the `-utf8` pragma: This makes CGI.pm treat all parameters as UTF-8 strings. Use this with care, as it will interfere with the processing of binary uploads. It is better to manually select which fields are expected to return utf-8 strings and convert them using code like this: `use Encode; my $arg = decode utf8=>param('foo');` [download] The problem with binary (file) uploads is due to CGI's legacy, as it partially treats file uploads as form parameters instead of keeping both separate. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply] [d/l] [select]
Re: Conversion of Extended Characters by andal (Hermit) on Dec 29, 2010 at 09:40 UTC
Actually, your problem has 2 sides. First is simple, the non-ASCII bytes from input form are encoded as %XX. This is taken care of either by simple substitute, or by CGI module or whatever. The second side of the problem is the encoding that was used during input. In other words you have to know the correspondence between sequence of bytes and the characters. Usually this information is available from the headers. When you find this information, then interpreting sequence of bytes into character is the matter of applying appropriate conversion. If the output page uses the same encoding as the input page, then no conversion is needed. If the encodings don't match, then you can use Encode::from_to to convert the input into desired encoding for the output.	[reply]