in reply to Parsing badly formed RSS or XML

XML parsers are supposed to barf on non-wellformed XML - it's in the spec. You should shout at whoever gives you the XML until they fix their output.

In the meantime, you can make your parsing script die a little more gracefully by using eval like this:

my $file = 'file.rss'; my $p = XML::RSS->new; eval { $p->parsefile($file) }; if ($@) { die "Bad XML document!!\n"; } else { print "Good XML!\n"; }
--
<http://www.dave.org.uk>

"Perl makes the fun jobs fun
and the boring jobs bearable" - me

Replies are listed 'Best First'.
Re: Re: Parsing badly formed RSS or XML
by tomhukins (Curate) on Feb 22, 2001 at 21:36 UTC

    I realise that XML parsers are supposed to reject badly formed XML, but in my limited experience, a sizeable proportion of RSS feeds are badly deployed.

    I have alerted several Webmasters to problems I've encountered, but problems aren't always fixed.

    Until recently, I used code very similar to what you have above. However, I frequently found myself missing information from badly formed RSS feeds. I can understand the benefits of ignoring badly-formed XML in mission critical situations, but for RSS feeds I'd rather misinterpret the information I'm receiving than receive no information at all. Others' opinions may differ.

    With hindsight, I should have written a strong disclaimer with my code that it breaks the XML spec.

      I have alerted several Webmasters to problems I've encountered, but problems aren't always fixed.

      I wonder if you've considered a message on your page along the lines of "we would have liked to have been able to give you information from (name of website), but unfortunately their data feed that claims to be XML isn't and therefore it breaks well behaved parsers".

      If we let people get away with producing bad XML, then we're heading down a path that leads to the same sort of nightmare that we currently have with HTML.

      --
      <http://www.dave.org.uk>

      "Perl makes the fun jobs fun
      and the boring jobs bearable" - me

        If we let people get away with producing bad XML, then we're heading down a path that leads to the same sort of nightmare that we currently have with HTML.

        Agreed. Perhaps a suitable compromise might be to attempt to identify badly formed RSS using your Good XML/Bad XML test and alert users if a site isn't outputting information properly and why this is bad. Maybe using <BLINK> tags. ;-)

        At present, my application only runs on the intranet where I work, so advocacy of well-formed XML isn't such an issue.

        Anyway, I'm off to perform another violation of my beliefs and print out documentation to MS-Word. It's not for me, though. Honest!