Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Crashing XML::LibXML by setting UserAgent

by hacker (Priest)
on May 27, 2003 at 13:00 UTC ( [id://260981]=perlquestion: print w/replies, xml ) Need Help??

hacker has asked for the wisdom of the Perl Monks concerning the following question:

I just found that on very specific web pages parsed by my scripts, that setting a UserAgent value through LWP::UserAgent seems to crash XML::LibXML, and I can't figure out why.

Here's a snippet to exhibit this behavior. Run this, then uncomment the $browser->agent($ua); call to see the crash:

use strict; use LWP::UserAgent; use XML::LibXML; # Or http://www.cboe.com/Chinese/ my $url = 'http://www.cboe.com/Spanish/'; my $ua = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)'; my $browser = LWP::UserAgent->new(); # Uncomment this line to crash libxml's parser # $browser->agent($ua); my $response = $browser->get($url); my $content = $response->content; my $type = $response->content_type; print "Cleaning $url ($type)...\n"; clean_html($content); sub clean_html { my $input = shift; my $parser = XML::LibXML->new(); $parser->recover(1); my $cleaned = $parser->parse_html_string($input)->toStringHTML; }

It only happens on some pages, and I can't find a pattern in the actual HTML content itself that causes it. Here's the error I get when it dies (a few hundred of these for this particular page parsed):

output conversion failed due to conv error Bytes: 0xE4 0xE5 0xE6 0xE8 xmlOutputBufferWrite: encoder error output conversion failed due to conv error Bytes: 0xE5 0xE6 0xE8 0x26 xmlOutputBufferWrite: encoder error ...

Comment out the UserAgent value, and it seems to work perfectly, without any errors. What could be causing this?

Update: Further investigation shows that the '/' in the UserAgent value is stuffing up XML::LibXML here. Changing the '/' to anything else will cause the page to work.

Why would a slash character in the UserAgent string dump libxml?

Replies are listed 'Best First'.
Re: Crashing XML::LibXML by setting UserAgent
by PodMaster (Abbot) on May 27, 2003 at 13:55 UTC
    UserAgent has nothing to do with XML::LibXML, this is a problem with encodings.

    Did you google? http://mail.gnome.org/archives/xml/2003-January/msg00038.html

    >  You pasted in the tree substrings which were not UTF8, check the input you 
    >store in the tree for proper encoding. I assume you have
    >read and understood:
    >    http://xmlsoft.org/encoding.html
    >
    >Daniel
    
    
    To your snippet, I added
    use Encode::Guess; die "guessing encoding ", guess_encoding($content, Encode->encodings(":all") );
    and I got
    UTF-32BE:Partial character at G:/Perl/lib/Encode/Guess.pm line 124.
    UCS-2LE:Partial character at G:/Perl/lib/Encode/Guess.pm line 124.
    UTF-32LE:Partial character at G:/Perl/lib/Encode/Guess.pm line 124.
    UTF-16BE:Partial character at G:/Perl/lib/Encode/Guess.pm line 124.
    UTF-16LE:Partial character at G:/Perl/lib/Encode/Guess.pm line 124.
    UTF-32:Unrecognised BOM 3c534352 at G:/Perl/lib/Encode/Guess.pm line 124.
    
    The "bad" response has a meta tag that says CHARSET=gb2312, so I do a search, and see that Encode::CN mentions it gb2312. I hope this helps.

    update: Try clean_html(decode('euc-cn', $content ));, it will help (man this has got to say something about your debugging skills).


    MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
    I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests.
    ** The Third rule of perl club is a statement of fact: pod is sexy.

      I'll ignore your arrogant assertion regarding my diagnostic abilities. I've done the research on google and on irc before posting the node, so take those comments elsewhere, they aren't helpful.

      A slightly more-condensed example (using Encode as you suggested) also fails.

      I'll keep debugging to find out how to work around this. The site is definately switching on the UserAgent value sent in the request, but without relying on HTML::Entities to encode the whole lot, it doesn't like what it gets from Encode::CN here.

      use strict; # use LWP::Debug qw(+); use LWP::UserAgent; use XML::LibXML; use Encode qw/encode decode/; my $url = 'http://www.cboe.com/Chinese'; my $ua = 'Mozilla/5.0 (en-US; rv:1.4b) Gecko/20030514'; my $browser = LWP::UserAgent->new( agent => "$ua"); my $response = $browser->get($url); my $content = $response->content; print "Cleaning $url...\n"; # gb2312-raw also fails my $euc_cn = encode("euc-cn", $content); my $utf8 = decode("euc-cn", $euc_cn); clean_html($euc_cn); sub clean_html { my $input = shift; my $p = XML::LibXML->new(); # parser $p->recover(1); my $cleaned = $p->parse_html_string($input)->toStringHTML; }
        How does it fail? Works perfectly fine for me.
        Encode 1.95 LWP::UserAgent 2.003 XML::LibXML 1.54
        The UserAgent value is utterly irrelevant to the encoding problem you are experiencing with XML::LibXML.

        update: I'm on Win2000 SP3, MSWin32-x86-multi-thread-5.8, ActivePerl Build 804


        MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
        I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests.
        ** The Third rule of perl club is a statement of fact: pod is sexy.

Re: Crashing XML::LibXML by setting UserAgent
by arturo (Vicar) on May 27, 2003 at 13:54 UTC

    Given that the default user agent string libwww-perl/#.## contains a slash, it's unlikely that it's the slash as such. My overwhelming inclination is to say that LibXML (iconv?) is not lying -- it's an encoding conversion problem, and thus it appears that what you're getting as input depends on the user agent string, such that the web SERVER is delivering different content to you depending on the contents of your user agent string. Perhaps the server recognizes the default string and delivers content accordingly, perhaps it *doesn't* recognize the ones you're setting and goes to some sort of default (if it uses some sort of regex solution, it expects the slash to precede the client's version number, for example, and it doesn't recognize "Mozilla version 5"). The other possibility that suggests itself, which seems pretty remote to me, and is not true of the source on my installation, is that calling agent has side effects (resetting content-accept, e.g.). But as I say, that's not true of my installation and I don't believe it will turn out to be true of yours.

    So, check the encoding on the incoming content. That's my main suggestion, and it's probably a good idea anyway because you're dealing with i18n issues here anyway.

    If not P, what? Q maybe?
    "Sidney Morgenbesser"

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://260981]
Front-paged by TStanley
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2024-04-19 13:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found