in reply to How do I convert any given html to utf-8?

Hi, it's every time a headache, if you handle content without knowledge of the correct encoding. The module Encode::Guess will help you a little bit, but the interface is horrible. If it matches a charset, it gives an object, if it's not sure, it delivers a string like "iso-8859-1 or iso-8859-15". Argl!

Try to get the encoding by any other way than guessing. Read the HTTP-Header or is there an HTML-Head-Encoding-Tag? Since many years, I worked with many charsets, there are so many things gone, till I strictly get the encoding information separately.

To transform the data fro one charset to another, the Encode module will help with things like this:

from_to($content, "iso-8859-1", "utf8");

Replies are listed 'Best First'.
Re^2: How do I convert any given html to utf-8?
by isync (Hermit) on Apr 23, 2007 at 21:47 UTC
    Actually I already looked into Encode::Guess but I couldn't believe that either this (guessing) or step by step iterating though the http header, meta-tags etc. was the solution. Both ways (the first insecure, the second tedious) looked awful.

    Is the second option really the only option to get it 98% right?

    Uh, oh. I just remembered that there is this HTML::Parser bug with utf8 as well...

    BTW: Still, any hint's on what LWP:UserAgent's decoded_content() actually does other than just handling gzip compression silently?