Parsing badly formed RSS or XML

tomhukins has asked for the wisdom of the Perl Monks concerning the following question:

I've written an application which parses RSS feeds and outputs a Web page so I can track changes on my favourite sites. There's nothing original or exciting about that.

I discovered that several sites I'm interested in sometimes output badly formed RSS. Special characters are often unescaped, which causes problems for XML::RSS and any other strict XML parser.

I've written a simple heuristic to parse badly formed XML, which might be of use to others. Maybe you can help me improve this?

use XML::RSS ();
my $rss_content = "RSS goes here";

my @replace = (
    ['\&(\s)', '"&amp;$1"'], # '&' followed by space
    ['(\s)>',  '"$1&gt;"'],   # '>' with preceding space
    ['\&',     '"&amp;"'],   # All '&'
);
my($data, $rss);
PARSE: while (my $repl = shift @replace) {
    $rss = XML::RSS->new;
    $data = eval { $rss->parse($rss_content) } and last PARSE;
} continue {
    $content =~ s/$repl->[0]/eval($repl->[1])/ge;
}
die unless $data;
[download]

I've only used this with XML::RSS, but this technique could be applied to XML::Parser, or any other XML parsing code.

Comment on Parsing badly formed RSS or XML Download Code

Replies are listed 'Best First'.
(jcwren) Re: Parsing badly formed RSS or XML by jcwren (Prior) on Feb 22, 2001 at 21:11 UTC
I've not played with RSS, but I can tell you that the XML specification explicitly states that bad XML must cause a parser abort. If someone is outputting bad XML, you need to report it to them, not make allowances in your code. Otherwise, it's going to break with people who are not "cleaning up" their errors. While the adage "be lenient what you accept, strict with what you give" is applicable in a large number of areas, XML parsing is not one them. Don't tolerate badly formatted XML. Raise hell with them until they fix it. --Chris e-mail jcwren	[reply]
Re: Parsing badly formed RSS or XML by davorg (Chancellor) on Feb 22, 2001 at 21:22 UTC
XML parsers are supposed to barf on non-wellformed XML - it's in the spec. You should shout at whoever gives you the XML until they fix their output. In the meantime, you can make your parsing script die a little more gracefully by using `eval` like this: `my $file = 'file.rss'; my $p = XML::RSS->new; eval { $p->parsefile($file) }; if ($@) { die "Bad XML document!!\n"; } else { print "Good XML!\n"; }` [download] -- <http://www.dave.org.uk> "Perl makes the fun jobs fun and the boring jobs bearable" - me	[reply] [d/l]
Re: Re: Parsing badly formed RSS or XML by tomhukins (Curate) on Feb 22, 2001 at 21:36 UTC
I realise that XML parsers are supposed to reject badly formed XML, but in my limited experience, a sizeable proportion of RSS feeds are badly deployed. I have alerted several Webmasters to problems I've encountered, but problems aren't always fixed. Until recently, I used code very similar to what you have above. However, I frequently found myself missing information from badly formed RSS feeds. I can understand the benefits of ignoring badly-formed XML in mission critical situations, but for RSS feeds I'd rather misinterpret the information I'm receiving than receive no information at all. Others' opinions may differ. With hindsight, I should have written a strong disclaimer with my code that it breaks the XML spec.	[reply]
Re: Re: Re: Parsing badly formed RSS or XML by davorg (Chancellor) on Feb 22, 2001 at 21:43 UTC
I have alerted several Webmasters to problems I've encountered, but problems aren't always fixed. I wonder if you've considered a message on your page along the lines of "we would have liked to have been able to give you information from (name of website), but unfortunately their data feed that claims to be XML isn't and therefore it breaks well behaved parsers". If we let people get away with producing bad XML, then we're heading down a path that leads to the same sort of nightmare that we currently have with HTML. -- <http://www.dave.org.uk> "Perl makes the fun jobs fun and the boring jobs bearable" - me	[reply]
Re: Re: Re: Re: Parsing badly formed RSS or XML by tomhukins (Curate) on Feb 22, 2001 at 22:08 UTC
Re: Parsing badly formed RSS or XML by mirod (Canon) on Feb 22, 2001 at 21:30 UTC
Actually you don't have to escape `>`, only `<`. If you replace `&` by `&` you will miss all the special characters, so you probably want to use an entity table from the W3C, such as these ones. In this case you don't have to replace & anymore (unless it's followed by a space of course). You would just have to add the entities declaration at the top of the RSS file. You would probably get Unicode characters though, which might break the rest of your processing, YMMV.	[reply]