How do I convert any given html to utf-8?

isync has asked for the wisdom of the Perl Monks concerning the following question:

I am writing a small proxy here and have everything up and working except one thing: transcoding to utf-8. (Don't ask why this is needed...)

Each time I try to get utf-8 right, I get lost in all those perl versions with and without utf-8 support, in the (not working for me pack() trick), in checking the perl utf8 bits and reading the Encoding documentation.

And each time I end up with long lines of code that do not work. But I sense that there is an easy solution. Nobody else seems to have such a headache with utf-8. So:

What is the right "any-encoding - to - utf-8" three-liner I am missing?

Any hints to the right module from cpan welcome (and maybe a short example to finally use it right).

One more question:
Does LWP::UserAgent's

$response->decoded_content(default_charset => 'utf-8')
[download]

always return utf-8?

Comment on How do I convert any given html to utf-8? Download Code

Replies are listed 'Best First'.
Re: How do I convert any given html to utf-8? by Errto (Vicar) on Apr 24, 2007 at 00:56 UTC
From reading the documentation it would appear that indeed LWP will return you a proper Perl string with properly decoded text if the correct Content-type header is present in the code. If the pages you are downloading are not in UTF-8 and they do not contain a Content-type header specifying what encoding they are in, then `$response->decoded_content(default_charset => 'utf-8')` [download] will not work and you will need to use some method to guess the correct encoding. But if those pages do have proper headers and/or you can otherwise assume they are in UTF-8, then yes that should work. There is a one-line "any-encoding - to - utf-8" conversion in Perl, but it requires you to know what encoding you're starting with. The function to use is the `decode` function in Encode.	[reply] [d/l] [select]
Re: How do I convert any given html to utf-8? by Tobiwan (Beadle) on Apr 23, 2007 at 21:19 UTC
Hi, it's every time a headache, if you handle content without knowledge of the correct encoding. The module Encode::Guess will help you a little bit, but the interface is horrible. If it matches a charset, it gives an object, if it's not sure, it delivers a string like "iso-8859-1 or iso-8859-15". Argl! Try to get the encoding by any other way than guessing. Read the HTTP-Header or is there an HTML-Head-Encoding-Tag? Since many years, I worked with many charsets, there are so many things gone, till I strictly get the encoding information separately. To transform the data fro one charset to another, the Encode module will help with things like this: `from_to($content, "iso-8859-1", "utf8");`	[reply] [d/l]
Re^2: How do I convert any given html to utf-8? by isync (Hermit) on Apr 23, 2007 at 21:47 UTC
Actually I already looked into Encode::Guess but I couldn't believe that either this (guessing) or step by step iterating though the http header, meta-tags etc. was the solution. Both ways (the first insecure, the second tedious) looked awful. Is the second option really the only option to get it 98% right? Uh, oh. I just remembered that there is this HTML::Parser bug with utf8 as well... BTW: Still, any hint's on what LWP:UserAgent's decoded_content() actually does other than just handling gzip compression silently?	[reply]