user2000 has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I am trying to use LibXML to process a document fetched using LWP. Now the problem is that when a document is got a character encoding other than utf-8, and i do toStringHTML, i get strange characters like "A" with a accent on top of it. I setEncoding to say ISO-8895-1. But I do not know where the problem lies.
my $parser = new XML::LibXML; $parser->recover_silently(1); my $dom = $parser->parse_html_string($htmlfromlwp); $dom->setEncoding($encoding); $output = $dom->toStringHTML();
Thank you, Anant

Replies are listed 'Best First'.
Re: XML::LibXML encoding problem
by shmem (Chancellor) on Sep 30, 2007 at 12:26 UTC
    From the XML::LibXML::Parser pod:
    parse_html_file
    $doc = $parser->parse_html_file( $htmlfile, \%opts );
    Similar to parse_file() but parses HTML (strict) documents; $htmlfile can be filename or URL.

    An optional second argument can be used to pass some options to the HTML parser as a HASH reference. Possible options are: Possible options are: encoding and URI for libxml2 < 2.6.27, and for later versions of libxml2 additionally: recover, sup- press_errors, suppress_warnings, pedantic_parser, no_blanks, and no_network.

    So you probably want something like

    my $dom = $parser->parse_html_string($htmlfromlwp, { encoding => 'iso8 +859-1' } );

    since setting the encoding after the fact ( = parsing) doesn't result in re-parsing.

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}