DOS coding compatibility with perl "é"->"'" causes http://www.perlmonks.org/headlines.rdf to be invalid:

not well-formed (invalid token) at line 23, column 44, byte 622 at /us +r/lib/perl5/XML/Parser.pm line 185

Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

Replies are listed 'Best First'.
Re: Broken headlines
by Aristotle (Chancellor) on Oct 01, 2003 at 14:09 UTC

      An RDF feed for the Monastery. It's a little broken, but it should be easy enough for you to parse with perl.

      I am using Perl. Specifically with XML::RSS. Besides, broken XML is not XML. This site doesn't use XML, it uses something that happens to look like it.

      "Don't parse XML with an XML parser, use regexes!". I guess it must be very hard to generate correct XML. After all -- and XML barbie concurs -- XML is *hard*!

      I will use an extra Perl script. Not to parse the XML, because that would be extremely silly. But to try to make valid XML from the string I get.

      Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

        Would you consider sharing the script you're using to re-format the XML?
Re: Broken headlines
by PodMaster (Abbot) on Oct 01, 2003 at 20:08 UTC
    Try using HTML::Parser like I did at Re: cblast35 to avoid choking XML::Parser, as this is not likely to get resolved any time soon.

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.

      Try using HTML::Parse

      That would mean patching XML::RSS. I'm currently just filtering perlmonks' data to try to make it valid. With success, so far.

      PerlMonks--. If you want people to use Perl to parse your headlines, then don't make it look like XML! Just colon separated fields would do a much better job.

      Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

        I'd be happy to take down the broken RSS feed or refund your purchase price if it bothers you that much. Or feel free to just not use it.

        [ I didn't write it. There is no infrastructure in place for pmdev to even look at the code. I thank whoever did write it for donating the effort, even if they didn't manage to get it perfect.

        If you looked around a bit you'd probably notice that lots of the XML feeds used to have similar problems and such could be resolved by adding a character encoding header.

        The whole site is volunteers. I thank those who volunteer answers and discussions. It is a "gift" culture and I donate to it because it is often fun and I often enjoy helping people or producing something of some value.

        Someone on pmdev could probably just write code to produce similar but better output based on viewing the current RSS output (and perhaps how the new XML feeds are done, though they don't handle control characters properly -- which are only sent by non-conforming clients and may already be filtered from node titles, but this should, of course, still be fixed). I'd be grateful if someone cared to volunteer to do that (but I certainly don't feel I'm due such a contribution and understand some of the frustration of trying to do that). Such would certainly motivate me to raise the priority on trying out such code and trying to replace the broken feed.

        Frankly, ranting is anything but motivating, for me. ]

                        - tye