in reply to XML::RSS::Parser::Lite Question

The HTML likely is entity-encoded. Have a look at HTML::Entities. Also, you should be aware that malicious HTML could be injected into your page from such a feed if you're not careful. My advice is to let only "safe" HTML tags through, like <p>, <b>, <i>. I wouldn't even embed images, as that implies a HTTP request from the client viewing your aggregate to a potentially unsafe server.

Replies are listed 'Best First'.
Re^2: XML::RSS::Parser::Lite Question
by Utilitarian (Vicar) on Nov 18, 2009 at 10:53 UTC
    Corion is correct on all points above, part of the XML spec is that any text between tags eg.
    <description><b>Best post ever: </b>This is a super hoopy post froods< +/description>
    Must be rendered XML safe, ie
    <description>&lt;b&gt;Best post ever: &lt;/b&gt;This is a super hoopy +post froods</description>
    This prevents confusion when using XPath tools.

    On security, if your users are loading remote data from a session on your service, be very very sure that

    • No javascript injection is possible
    • You are not revealing session info (HTTP_REFERER)
    • No javascript injection is possible
    Do not blindly convert the HTML::Entities back to HTML as this may result in execution of malicious code within your users' browsers, while they are logged into your service.
    The best way of preventing XSS is with whitelisting of HTML tags and allowed attributes for each tag
        (consider <b onmouseover="doEvil();">Some text</b> when allowing specific tags) have a look at HTML::Scrubber

    The best way of retrieving remote images witout revealing session info is to ensure all such info is in the header rather than URL of requests (POST).

    EditAnd another thing about remote images I'd forgotten to mention, some browsers do content sniffing and ignore the alledged nature of the content, Interesting article on the dangers of content sniffing and how to handle

      Thank you for these warnings!

      Am not sure of all the implications of the security issues you mention, but - since the above code contained links which would evoke HTTP_REFERER, I disabled the script.

      The bulk of my actual intent was to use the RSS scripts under privatized servers, so only trusted content would be fed to the aggregator. But these issues with public content are good to know, as there's always the want for greater inclusion.

      If anyone wishes to further comment, then please feel free to do so. I'm still not sure of how the HTTP_REFERER could be traced, but looking into it now. Anyway, the script link above will not work - though the question is still held open for comment.

      Ty.

      BH

        Hi BlenderHead,
        The implication of HTTP_REFERER is that where session info is present in the url of the page, that info will be contained in the HTTP_REFERER header, this can be abused to extract info of poorly guarded sessions to capture currently running session.

        The issue with your script's output is the HTML::Entity encoding of data within XML files, in this case the item.description. To resolve use HTML::Entity to decode the encoded chars, however you should then use HTML::Scrubber on the resulting output if you do not trust the originating source.

        And Corion's advice about not displaying images from remote/untrusted sources is wise until you have developed a security policy for this scenario. Happy reading ;)