isync has asked for the wisdom of the Perl Monks concerning the following question:

I am writing a small proxy here and have everything up and working except one thing: transcoding to utf-8. (Don't ask why this is needed...)

Each time I try to get utf-8 right, I get lost in all those perl versions with and without utf-8 support, in the (not working for me pack() trick), in checking the perl utf8 bits and reading the Encoding documentation.

And each time I end up with long lines of code that do not work. But I sense that there is an easy solution. Nobody else seems to have such a headache with utf-8. So:

What is the right "any-encoding - to - utf-8" three-liner I am missing?

Any hints to the right module from cpan welcome (and maybe a short example to finally use it right).


One more question:
Does LWP::UserAgent's
$response->decoded_content(default_charset => 'utf-8')
always return utf-8?

Replies are listed 'Best First'.
Re: How do I convert any given html to utf-8?
by Errto (Vicar) on Apr 24, 2007 at 00:56 UTC

    From reading the documentation it would appear that indeed LWP will return you a proper Perl string with properly decoded text if the correct Content-type header is present in the code. If the pages you are downloading are not in UTF-8 and they do not contain a Content-type header specifying what encoding they are in, then

    $response->decoded_content(default_charset => 'utf-8')
    will not work and you will need to use some method to guess the correct encoding. But if those pages do have proper headers and/or you can otherwise assume they are in UTF-8, then yes that should work.

    There is a one-line "any-encoding - to - utf-8" conversion in Perl, but it requires you to know what encoding you're starting with. The function to use is the decode function in Encode.

Re: How do I convert any given html to utf-8?
by Tobiwan (Beadle) on Apr 23, 2007 at 21:19 UTC
    Hi, it's every time a headache, if you handle content without knowledge of the correct encoding. The module Encode::Guess will help you a little bit, but the interface is horrible. If it matches a charset, it gives an object, if it's not sure, it delivers a string like "iso-8859-1 or iso-8859-15". Argl!

    Try to get the encoding by any other way than guessing. Read the HTTP-Header or is there an HTML-Head-Encoding-Tag? Since many years, I worked with many charsets, there are so many things gone, till I strictly get the encoding information separately.

    To transform the data fro one charset to another, the Encode module will help with things like this:

    from_to($content, "iso-8859-1", "utf8");
      Actually I already looked into Encode::Guess but I couldn't believe that either this (guessing) or step by step iterating though the http header, meta-tags etc. was the solution. Both ways (the first insecure, the second tedious) looked awful.

      Is the second option really the only option to get it 98% right?

      Uh, oh. I just remembered that there is this HTML::Parser bug with utf8 as well...

      BTW: Still, any hint's on what LWP:UserAgent's decoded_content() actually does other than just handling gzip compression silently?