in reply to Crashing XML::LibXML by setting UserAgent

UserAgent has nothing to do with XML::LibXML, this is a problem with encodings.

Did you google? http://mail.gnome.org/archives/xml/2003-January/msg00038.html

>  You pasted in the tree substrings which were not UTF8, check the input you 
>store in the tree for proper encoding. I assume you have
>read and understood:
>    http://xmlsoft.org/encoding.html
>
>Daniel

To your snippet, I added
use Encode::Guess; die "guessing encoding ", guess_encoding($content, Encode->encodings(":all") );
and I got
UTF-32BE:Partial character at G:/Perl/lib/Encode/Guess.pm line 124.
UCS-2LE:Partial character at G:/Perl/lib/Encode/Guess.pm line 124.
UTF-32LE:Partial character at G:/Perl/lib/Encode/Guess.pm line 124.
UTF-16BE:Partial character at G:/Perl/lib/Encode/Guess.pm line 124.
UTF-16LE:Partial character at G:/Perl/lib/Encode/Guess.pm line 124.
UTF-32:Unrecognised BOM 3c534352 at G:/Perl/lib/Encode/Guess.pm line 124.
The "bad" response has a meta tag that says CHARSET=gb2312, so I do a search, and see that Encode::CN mentions it gb2312. I hope this helps.

update: Try clean_html(decode('euc-cn', $content ));, it will help (man this has got to say something about your debugging skills).


MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests.
** The Third rule of perl club is a statement of fact: pod is sexy.

Replies are listed 'Best First'.
Re: Crashing XML::LibXML by setting UserAgent
by hacker (Priest) on May 27, 2003 at 17:09 UTC

    I'll ignore your arrogant assertion regarding my diagnostic abilities. I've done the research on google and on irc before posting the node, so take those comments elsewhere, they aren't helpful.

    A slightly more-condensed example (using Encode as you suggested) also fails.

    I'll keep debugging to find out how to work around this. The site is definately switching on the UserAgent value sent in the request, but without relying on HTML::Entities to encode the whole lot, it doesn't like what it gets from Encode::CN here.

    use strict; # use LWP::Debug qw(+); use LWP::UserAgent; use XML::LibXML; use Encode qw/encode decode/; my $url = 'http://www.cboe.com/Chinese'; my $ua = 'Mozilla/5.0 (en-US; rv:1.4b) Gecko/20030514'; my $browser = LWP::UserAgent->new( agent => "$ua"); my $response = $browser->get($url); my $content = $response->content; print "Cleaning $url...\n"; # gb2312-raw also fails my $euc_cn = encode("euc-cn", $content); my $utf8 = decode("euc-cn", $euc_cn); clean_html($euc_cn); sub clean_html { my $input = shift; my $p = XML::LibXML->new(); # parser $p->recover(1); my $cleaned = $p->parse_html_string($input)->toStringHTML; }
      How does it fail? Works perfectly fine for me.
      Encode 1.95 LWP::UserAgent 2.003 XML::LibXML 1.54
      The UserAgent value is utterly irrelevant to the encoding problem you are experiencing with XML::LibXML.

      update: I'm on Win2000 SP3, MSWin32-x86-multi-thread-5.8, ActivePerl Build 804


      MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
      I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests.
      ** The Third rule of perl club is a statement of fact: pod is sexy.

        The UserAgent value is utterly irrelevant to the encoding problem you are experiencing with XML::LibXML.
        Indeed, we established that, however, I need to send the UserAgent in this case, in order to get the content I need, because the site switches on it.

        Initially, I didn't see that commenting it out returned a VB error upstream on their end, inside the content, thus delivering me content, but the WRONG content, when the "proper" UserAgent isn't sent in the request. This is why I originally attributed it to a UserAgent error. That is no longer the case. It is an error of encodings, as we have now fleshed out.

        In response to your question.. I had version 1.94 of Encode, and just upgraded it to 1.95, with the same results. I'm was also using 2.003 of LWP::UserAgent and 1.53 of XML::LibXML (which I just upgraded to 1.54 from Phish's directory; CPAN didn't seem to notice the newer version). After upgrading, the failures are still the same.

        Are you on Windows? Or a POSIX system? It fails here on 3 Linux systems and 1 FreeBSD 4.8 system, all clean installs.