in reply to Re: Crashing XML::LibXML by setting UserAgent
in thread Crashing XML::LibXML by setting UserAgent

I'll ignore your arrogant assertion regarding my diagnostic abilities. I've done the research on google and on irc before posting the node, so take those comments elsewhere, they aren't helpful.

A slightly more-condensed example (using Encode as you suggested) also fails.

I'll keep debugging to find out how to work around this. The site is definately switching on the UserAgent value sent in the request, but without relying on HTML::Entities to encode the whole lot, it doesn't like what it gets from Encode::CN here.

use strict; # use LWP::Debug qw(+); use LWP::UserAgent; use XML::LibXML; use Encode qw/encode decode/; my $url = 'http://www.cboe.com/Chinese'; my $ua = 'Mozilla/5.0 (en-US; rv:1.4b) Gecko/20030514'; my $browser = LWP::UserAgent->new( agent => "$ua"); my $response = $browser->get($url); my $content = $response->content; print "Cleaning $url...\n"; # gb2312-raw also fails my $euc_cn = encode("euc-cn", $content); my $utf8 = decode("euc-cn", $euc_cn); clean_html($euc_cn); sub clean_html { my $input = shift; my $p = XML::LibXML->new(); # parser $p->recover(1); my $cleaned = $p->parse_html_string($input)->toStringHTML; }

Replies are listed 'Best First'.
Re: Re: Crashing XML::LibXML by setting UserAgent
by PodMaster (Abbot) on May 27, 2003 at 17:35 UTC
    How does it fail? Works perfectly fine for me.
    Encode 1.95 LWP::UserAgent 2.003 XML::LibXML 1.54
    The UserAgent value is utterly irrelevant to the encoding problem you are experiencing with XML::LibXML.

    update: I'm on Win2000 SP3, MSWin32-x86-multi-thread-5.8, ActivePerl Build 804


    MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
    I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests.
    ** The Third rule of perl club is a statement of fact: pod is sexy.

      The UserAgent value is utterly irrelevant to the encoding problem you are experiencing with XML::LibXML.
      Indeed, we established that, however, I need to send the UserAgent in this case, in order to get the content I need, because the site switches on it.

      Initially, I didn't see that commenting it out returned a VB error upstream on their end, inside the content, thus delivering me content, but the WRONG content, when the "proper" UserAgent isn't sent in the request. This is why I originally attributed it to a UserAgent error. That is no longer the case. It is an error of encodings, as we have now fleshed out.

      In response to your question.. I had version 1.94 of Encode, and just upgraded it to 1.95, with the same results. I'm was also using 2.003 of LWP::UserAgent and 1.53 of XML::LibXML (which I just upgraded to 1.54 from Phish's directory; CPAN didn't seem to notice the newer version). After upgrading, the failures are still the same.

      Are you on Windows? Or a POSIX system? It fails here on 3 Linux systems and 1 FreeBSD 4.8 system, all clean installs.