in reply to Re: Broken headlines
in thread Broken headlines

Try using HTML::Parse

That would mean patching XML::RSS. I'm currently just filtering perlmonks' data to try to make it valid. With success, so far.

PerlMonks--. If you want people to use Perl to parse your headlines, then don't make it look like XML! Just colon separated fields would do a much better job.

Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

Replies are listed 'Best First'.
Re^3: Broken headlines (gift)
by tye (Sage) on Oct 02, 2003 at 06:10 UTC

    I'd be happy to take down the broken RSS feed or refund your purchase price if it bothers you that much. Or feel free to just not use it.

    [ I didn't write it. There is no infrastructure in place for pmdev to even look at the code. I thank whoever did write it for donating the effort, even if they didn't manage to get it perfect.

    If you looked around a bit you'd probably notice that lots of the XML feeds used to have similar problems and such could be resolved by adding a character encoding header.

    The whole site is volunteers. I thank those who volunteer answers and discussions. It is a "gift" culture and I donate to it because it is often fun and I often enjoy helping people or producing something of some value.

    Someone on pmdev could probably just write code to produce similar but better output based on viewing the current RSS output (and perhaps how the new XML feeds are done, though they don't handle control characters properly -- which are only sent by non-conforming clients and may already be filtered from node titles, but this should, of course, still be fixed). I'd be grateful if someone cared to volunteer to do that (but I certainly don't feel I'm due such a contribution and understand some of the frustration of trying to do that). Such would certainly motivate me to raise the priority on trying out such code and trying to replace the broken feed.

    Frankly, ranting is anything but motivating, for me. ]

                    - tye

      /me is not an XML expert, but so far, s/([^\x20\x21\x23-\x25\x28-\x3b\x3d\x3f-\x7e])/sprintf "&#%d;", ord $1/ge has been very helpful for me. (Although most of the time I am too lazy to look up which characters are valid, and just use s/(\W)/sprintf "&#%d;", ord $1/ge; (And even more often, I am too lazy to even write XML myself, and use a module for that))

      I also thank whoever made headlines.rdf, because it is a wonderful idea. But that does not mean I like how it works (doesn't work), and even wonderful ideas need to be implemented correctly.

      On PerlMonks, when reporting a bug, you get the strangest answers. I thought the open source community got ruder by the day, but apparently the monster that is called "WONTFIX" or "Patches welcome" has affected the closed source community as well.

      The common advice is "Don't use regexes to parse XML, use an XML parser". Especially in this very monastery, this is said a lot. But when the XML is broken, of course (?), instead of asking for suggestions or maybe even rudely mentioning that patches are welcome, the people here document that it is broken and suggest parsing XML *without* a normal XML parser!

      This is a very well known development strategy:

      1. Something breaks
      2. The broken behaviour is documented
      3. Everyone who expects things to works is wrong. After all, the bug is documented and therefor a feature.
      4. Documented behaviour never needs to be corrected

      I disagree. If you don't know how to fix it, there are places where you can ask for help. Actually, that place is here, in our very own Seekers of Perl Wisdom. But I really cannot believe that whoever made this feed doesn't know how to fix it.

      But yes, if the powers that be are unwilling to make the RDF be XML, it should indeed be removed and replaced by something that doesn't fool people into believing that it is XML.

      In my opinion, a fix is appropriate, and not at all hard. Please tell the people that have access to the code.

      I accept your offer to refund the purchase price. Thank you very much.

      Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

      P.S. Yes, XML is scary. I also like to avoid it. But when there is a standard, and you choose to use that standard, make sure you are compliant. If you don't want to use the standard, don't use its syntax.

        Well, you got my point completely wrong. I never said the problem shouldn't be fixed nor even that it wouldn't be fixed. I certainly didn't say anything about not knowing how to fix it. It isn't particularly hard to fix it (I even outlined how several people could go about fixing it). I'd try explaining again but I doubt I'd have better luck.

                        - tye
        Documented behaviour never needs to be corrected
        I disagree, at least partly. Most modules on CPAN have a "BUGS" section in their POD for a reason - it doesn't mean it's never going to be corrected, it just means the bug is known. Depending on whether it is feasible to fix (and this one certainly is), how hard it is to do, and what priorities there are, it will be fixed sooner or later. In the meantime, users should be aware of the problem.

        Makeshifts last the longest.