I have started to play with RSS/RDF feeds. I read a few articles on the topic, and found that RSS is very easy to use. Grab the file via HTTP (HTTP::GHTTP or LWP), use the XML::RSS parser to convert to a constant format, and then use XML::LibXSLT to transform to a piece of usable HTML.

I've come up against a few problems (e.g. Why so slow from CGI, but not command line?), the most annoying of which is that some people provide RSS feeds that are not valid XML. The commonest problems I've seen is that they include entities without including the entity definition, and there is terrible unescaping of the ampersand symbol.

As per any good XML application this causes the system to fail with no usable output.

Without wishing to encourage sloppy XML, as some companies encouraged sloppy HTML, what methods are there available to wash the file before passing it to RSS or LibXSLT?

I've considered using a RegEx, but I know that's not a place I wish to go. Though it could be a way of locating isolated an & and replacing it with &.

Yesterday I saw Matts mention xmllint on his use Perl; column. I don't know anything about it or if there is a Perl interface for it that works on Windows. I see there is also XML::Clean and I could always try Tidy in XML mode.

Questions:

  1. Am I being unwise in accepting sloppy XML, is this the thin end of the wedge?
  2. Should I accept the sloppy XML, I'm too small to get big companies to clean their act up, and wash the XML?
  3. If I am to clean up the XML, what Perl ways of doing this are there?

Humble thanks in advance...

Some Useful Resources:


In reply to How do I clean RSS feeds to make them usable? by ajt

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.