.rhavin has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks;-)
After spending some night(mare)s with encoding madness, i finally got a nearly working solution for the following task:

i set up agent (package var $XAgent) like this:

# ----------------------------------------------------- sub _setAgent { $XAgent = LWP::UserAgent->new(keep_alive => 1); $XAgent->default_header('Accept-Charset' => 'ISO-8859-1,utf-8'); $XAgent->agent($ENV{'HTTP_USER_AGENT'}); $XAgent->cookie_jar({}); # allow cookies };

i get data like this:

# ----------------------------------------------------- sub _getUrl { return $XAgent->get(shift)->decoded_content(); };

i request XML::Parser to parse data like that:

# ----------------------------------------------------- my $xml = _getUrl($url); my $p = XML::Parser->new(Style => 'Stream', Pkg => 'some_pkg', ProtocolEncoding => "utf-8"); $p->parse($xml);

If I leave out the 'utf-8'-hint for XML::Parser, some non-ascii chars get screwed up, so I thought "alright, decoded_content() returns perl-friendly utf-8, so set that manually!" Works almost everytime. Almost. So im not quite shure if i'm right about that assumption.

So my questions are:

Any further enlightenment and - of course - hits how do things better/faster highly welcome
TIA, ~.rhavin;)

Replies are listed 'Best First'.
Re: LWP::Agent vs. XML::Parser - the zillionth encoding madness question
by ikegami (Patriarch) on Jan 28, 2010 at 18:33 UTC

    XML::Parser requires an XML document. It honours BOMs and the encoding attribute. I confirmed this by testing.

    But you don't pass it an XML document. An XML document is a collection of bytes, but you decoded the XML document into characters. It's the parser's job to decode the values it returns using the encoding specified inside the document, so you need to avoid removing any character encoding. Fix:

    # Remove Content-Encoding (e.g. compression), # but leave document as bytes. my $xml = $response->decoded_content( charset => 'none' );

    Why did your code work if it was buggy? Because there's also a bug in XML::Parser::Expat. Expat incorrectly uses Perl's internal representation of the string as the XML document instead of using the contents of the string. Most of the time, your bug and this bug cancel out to produce the right output.

    Here's the workaround for the bug in XML::Expat (does nothing most of the time):

    # Expat expects the string to use this internal format. utf8::downgrade($xml) if $] ge '5.008'; $p->parse($xml);