The User nodes info XML generator still appears to be crippled, as mentioned by grinder a while back.

In addition to ampersands (&), single quotes (') are breaking our friendly neighborhood XML parsers, including XML::Twig which is based on XML::Parser which is in turn based on the expat library.

According to my O'Reilly XML book, there are five entity references predefined in XML:

&lt; The less-than sign, or opening angle bracket (<) &amp; The ampersand (&) &gt; The greater-than sign, or closing angle bracket (>) &quot; The straight, double quotation maarks (") &apos; The apostrophe, or single quote (')

Of these five, only &lt; and &amp; must be used in place of the literal characters in element content, whereas the other references are optional unless there is explicit conflicts within attribute values.

So really this sounds like a case of the expat library being overzealous. Does anyone know how to wrangle XML::Parser into wrangling xpat into being more forgiving?

If not, then why can't these five entity references be properly encoded in the XML user info generator?

Thanks,
Matt

Replies are listed 'Best First'.
Re: XML User Info Status?
by mirod (Canon) on May 08, 2002 at 08:48 UTC

    That's weird, the XML seems to parse OK for me. Could you provide an example of data that trips the parser? As it is I really can't see how a ' could cause any problem.

    For the record, I doubt very much that expat is the problem. It was written by James Clarke, who was one of the main author of XML, and who is quite famous for the quality of the tools he writes (troff, nsgmls, jade, expat). Actually expat is probably _the_ XML reference parser. If expat says it's not well-formed then you can assume it is not.

Re: XML User Info Status?
by mojotoad (Monsignor) on May 08, 2002 at 09:07 UTC
    Oh boy. Okay, so humans are not necessarily the best entity spotters. The original problem was not with the single quote (') but with an acute accent (´) which is commony used as a balanced closing single quote or a stylish apostrophe.

    In short, the character causing problems (besides myself) is an external entity.

    Now I'm wondering how the User nodes info xml generator can compensate, via a DTD, UTF-8 encoding or somesuch, for external entities that are legal HTML on PM.

    Matt

      All of the "old" XML feeds at PM (currently includes all except chatterbox xml ticker) suffer from 2 problems. First, they are missing a header declaring the character set which means that "unusual" characters1 cause problems. Second, some strange interaction causesXML::Generator to sometimes not escape characters that it should.

      Redoing all of the XML tickers to fix these two problems is planned. I've also heard plans to create new (additional) XML tickers that would rethink the whole layout, tag names, etc.

      1 Most characters outside of [\s -~] but I'm not sure if it is just [\x80-\xff] or if other characters also cause problems.

              - tye (but my friends call me "Tye")