in reply to Broken headlines

Try using HTML::Parser like I did at Re: cblast35 to avoid choking XML::Parser, as this is not likely to get resolved any time soon.

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

Replies are listed 'Best First'.
Re: Re: Broken headlines
by Juerd (Abbot) on Oct 01, 2003 at 21:40 UTC

    Try using HTML::Parse

    That would mean patching XML::RSS. I'm currently just filtering perlmonks' data to try to make it valid. With success, so far.

    PerlMonks--. If you want people to use Perl to parse your headlines, then don't make it look like XML! Just colon separated fields would do a much better job.

    Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

      I'd be happy to take down the broken RSS feed or refund your purchase price if it bothers you that much. Or feel free to just not use it.

      [ I didn't write it. There is no infrastructure in place for pmdev to even look at the code. I thank whoever did write it for donating the effort, even if they didn't manage to get it perfect.

      If you looked around a bit you'd probably notice that lots of the XML feeds used to have similar problems and such could be resolved by adding a character encoding header.

      The whole site is volunteers. I thank those who volunteer answers and discussions. It is a "gift" culture and I donate to it because it is often fun and I often enjoy helping people or producing something of some value.

      Someone on pmdev could probably just write code to produce similar but better output based on viewing the current RSS output (and perhaps how the new XML feeds are done, though they don't handle control characters properly -- which are only sent by non-conforming clients and may already be filtered from node titles, but this should, of course, still be fixed). I'd be grateful if someone cared to volunteer to do that (but I certainly don't feel I'm due such a contribution and understand some of the frustration of trying to do that). Such would certainly motivate me to raise the priority on trying out such code and trying to replace the broken feed.

      Frankly, ranting is anything but motivating, for me. ]

                      - tye

        /me is not an XML expert, but so far, s/([^\x20\x21\x23-\x25\x28-\x3b\x3d\x3f-\x7e])/sprintf "&#%d;", ord $1/ge has been very helpful for me. (Although most of the time I am too lazy to look up which characters are valid, and just use s/(\W)/sprintf "&#%d;", ord $1/ge; (And even more often, I am too lazy to even write XML myself, and use a module for that))

        I also thank whoever made headlines.rdf, because it is a wonderful idea. But that does not mean I like how it works (doesn't work), and even wonderful ideas need to be implemented correctly.

        On PerlMonks, when reporting a bug, you get the strangest answers. I thought the open source community got ruder by the day, but apparently the monster that is called "WONTFIX" or "Patches welcome" has affected the closed source community as well.

        The common advice is "Don't use regexes to parse XML, use an XML parser". Especially in this very monastery, this is said a lot. But when the XML is broken, of course (?), instead of asking for suggestions or maybe even rudely mentioning that patches are welcome, the people here document that it is broken and suggest parsing XML *without* a normal XML parser!

        This is a very well known development strategy:

        1. Something breaks
        2. The broken behaviour is documented
        3. Everyone who expects things to works is wrong. After all, the bug is documented and therefor a feature.
        4. Documented behaviour never needs to be corrected

        I disagree. If you don't know how to fix it, there are places where you can ask for help. Actually, that place is here, in our very own Seekers of Perl Wisdom. But I really cannot believe that whoever made this feed doesn't know how to fix it.

        But yes, if the powers that be are unwilling to make the RDF be XML, it should indeed be removed and replaced by something that doesn't fool people into believing that it is XML.

        In my opinion, a fix is appropriate, and not at all hard. Please tell the people that have access to the code.

        I accept your offer to refund the purchase price. Thank you very much.

        Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

        P.S. Yes, XML is scary. I also like to avoid it. But when there is a standard, and you choose to use that standard, make sure you are compliant. If you don't want to use the standard, don't use its syntax.