cormanaz has asked for the wisdom of the Perl Monks concerning the following question:

Howdy bros. I'm having a problem trying to parse an RSS feed using XML::FeedPP. The feed is al Aribiya English, which looks like a normal feed when opened in a browser. I am getting it with XML::FeedPP like so:
$feedurl = 'http://www.alarabiya.net/rss/en_meast.xml'; eval { $feed = XML::FeedPP->new( $feedurl ) }; if ($@) { print LOG "\tFeed Error: $@\n"; }
When this executes STDERR gives a message that says:
Invalid string: [\xEF\xBF\xBD\x788\x4D\xEF\xBF\xBD\xEF\xBF\xBD\xEF\xBF +\xBD\xEF\xBF\xBD\xEF\xBF\xBD\xEF\xBF\xBD\xEF\xBF\xBD\x3F\xEF\xBF\xBD\ +xEF\xBF\xBD\x5D\xEF\xBF\xBD\xEF\xBF\xBD\x03\x55\x73\xEF\xBF\xBD\xEF\x +BF\xBD\xCC\x8D\x15\x73\xEF\xBF\xBD\xEF\xBF\xBD\xEF\xBF\xBD\xEF\xBF\xB +D\xEF\xBF\xBD\x10\x1A\xEF\xBF\xBD\x00'\x5B\xEF\xBF\xBD\x0C\x76\x1D\x7 +4\xEF\xBF\xBD\xEF\xBF\xBD\x64\x78\x71\xEF\xBF\xBD\x13\x60\xDC\xBB\xEF +\xBF\xBD\xEF\xBF\xBD\xEF\xBF\xBD(\x60\xEF\xBF\xBD\x5C\x65\xEF\xBF\xBD +\xEF\xBF\xBD=\x67\xEF\xBF\xBD\x0F\x16\xEF\xBF\xBD\x13\xEF\xBF\xBD&\x6 +7\xEF\xBF\xBD\xEF\xBF\xBD\x52\x69\x15!\x170\x16\x1E\x58\xEF\xBF\xBD\x +60\xEF\xBF\xBD\x6B%\xEF\xBF\xBD(\xEF\xBF\xBD\xEF\xBF\xBD\xEF\xBF\xBD\ +xEF\xBF\xBD\x0F\x55\xEF\xBF\xBD\x04\x6C\x19\x15\x73\xEF\xBF\xBD\x57\x +4A\xEF\xBF\xBD\x6C\x75\xEF\xBF\xBD2\xEF\xBF\xBD\xEF\xBF\xBD\xEF\xBF\x +BD\x0B'\x74\xCD\x8E\x5E\xEF\xBF\xBD\x57\xEF\xBF\xBD\xEF\xBF\xBD\x48\x +EF\xBF\xBD\x72\x4C\x6B\x55\x63,\x01\x04\x10\xEF\xBF\xBD\xEF\xBF\xBD\x +0L�3\xEF\xBF\xBD\x7B\x77\x05\xEF\xBF\xBD\xDD\xBB\xEF\xBF\xBD\x11\x7 +3\xEF\xBF\xBD\x79\xEF\xBF\xBD\x46] before <S �=�68�]¿½e�^��*���¿½7ï2B�Ì��簡$3�35*y> at / +home2/alkisahi/perl/usr/lib/perl5/site_perl/5.8.8/XML/FeedPP.pm line +521
Actually, it gives about five such invalid string errors.

The LOG written in the if statement above says:

Feed Error: Invalid feed format: %&#65533;:&#65533;q&#65533;&#65533;&# +65533;J&#65533;&#65533;DB"b&#65533;`
Then the whole script crashes.

This is puzzling in many ways. First, I don't get why the whole script crashes, given that the call is being made inside an eval block. Second, I don't know where these invalid strings are coming from because like I said the feed looks pretty normal to me. Third, XML::FeedPP is crashing on line 521, which is the first line of the following:

if ( $method eq 'url' ) { $tree = $tpp->parsehttp( GET => $source ); }
which doesn't even seem to be evaluating a string from the feed. :-/

Does anyone have any idea what is going on here or how I can debug?

TIA...Steve

Replies are listed 'Best First'.
Re: XML::FeedPP Crashing Despite Eval
by davorg (Chancellor) on Jun 25, 2009 at 08:36 UTC
    $ HEAD http://www.alarabiya.net/rss/en_meast.xml 200 OK Cache-Control: max-age=300, must-revalidate Connection: close Date: Thu, 25 Jun 2009 08:32:04 GMT Via: 1.1 12.120.13.61:80 (cache/2.6.2.1.2.ATT) Accept-Ranges: bytes Age: 147 ETag: "361876a-2ada-f61b9ec0" Server: Apache Vary: Accept-Encoding Content-Encoding: gzip Content-Length: 3778 Content-Type: text/xml Expires: Thu, 25 Jun 2009 08:37:04 GMT Last-Modified: Thu, 25 Jun 2009 08:25:07 GMT Client-Date: Thu, 25 Jun 2009 13:31:46 GMT Client-Peer: 12.120.13.36:80 Client-Response-Num: 1 X-Cache: HIT from 12.120.13.61

    Looks like the data that you're getting back is gzipped. You'll need to unzip it before processing it.

    --

    See the Copyright notice on my home node.

    Perl training courses

      Thanks! But WTF? I've never heard of a feed being gzipped. Does anyone know if this is common? Also why didn't eval trap the error?

        Zipping responses from web servers is pretty common. Browsers handle zipped content seamlessly.

        --

        See the Copyright notice on my home node.

        Perl training courses

        eval traps die, doesn't stop warnings, exit ...

        You get back gzip compressed content because ...

        1. The server is able to compress the requested content on-the-fly (typically, because gzip_cnc or mod_gzip are installed), and
        2. your user-agent said that it could handle gzip-compressed content ("Accept-Encoding" header)

        or

        1. The server is able to compress the requested content on-the-fly, and
        2. the server does not care about the capabilities of the user-agent

        In the first case, make sure that the announced and the real capabilities of your user-agent match, i.e. don't send "Accept-Encoding: gzip" if you can't or don't want to handle gzip-compressed content.

        In the latter case, the server is misconfigured, either accidentally or intentionally. All currently and commonly used browsers can handle gzip compressed content, so not checking the capabilities at all save a few thousand CPU cycles and some lines of code. Of course, this is not what I would call a well-behaved server.

        See also http://schroepl.net/projekte/mod_gzip/encoding.htm

        Your eval "surprise" was already explained: eval traps die, nothing less, nothing more. It does not help you with code that exits or messes up perl's internal structures.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re: XML::FeedPP Crashing Despite Eval
by ikegami (Patriarch) on Jun 25, 2009 at 16:04 UTC

    In parsehttp_lwp in XML::TreePP, change

    my $text = $res->content();

    to

    # Use decoded_content to handle HTTP Content-Encoding my $text = $res->decoded_content( charset => 'none' );

    I submitted this as rt://47336. Calling $response->content is almost always a bug.

      Yup, that took care of it. Many thanks.