cormanaz has asked for the wisdom of the Perl Monks concerning the following question:

Good Day Monks. I am trying to parse and atom feed using XML::TreeBuilder and its parser method is blowing up on the following a particular line. I would include the whole thing here, but based on the preview results it does bad things to the permonks page parser. So I include the first field, which is enclosed by an id tag and says:
tag:amazingracewiki.cbs.com,2007-04-03:/page/Race+Improvements/thread/ +580644/Canadian+Contestants+n&%2339;+More+heart+pumping+suspense.?off +set=0&maxResults=20/reply1

The parser is complaining about column 147 where begins the sequence &%2339; I guess the parser is treating this as an invalid escape sequence but it's part of a URL so who knows what it really is.

Question is, how can I deal with this? I don't want to throw away the whole feed because of this one squirrely entry. Is there some way to get the parser to throw out this one entry rather than DIEing?

Thx...

Steve

Replies are listed 'Best First'.
Re: XML::TreeBuilder invalid token problem
by Joost (Canon) on Apr 07, 2007 at 14:41 UTC
    "n&%2339" is not valid XML. An ampersand can only be used to refer to entities, like & and > and then only a few (non-numeric) entities are predefined.

    Update: I'm not sure if there is a way to work around invalid XML using XML::TreeBuilder. Conforming XML parsers are required to throw an exception when encountering invalid XML. In other words, XML parsers should parse valid XML and reject invalid XML with no way of working around it.

    In your case, the URL should have been escaped using n&%2339

Re: XML::TreeBuilder invalid token problem
by graff (Chancellor) on Apr 07, 2007 at 16:14 UTC
    I would interpret &%2339; as a failed attempt to encode the apostrophe character, as follows:
    1. apostrophe ('  = 0x27) is 39 decimal, so ' is the (decimal) numeric character entity for that
    2. pound (#  = 0x23) gets converted to %23, yielding &%2339;

    Since the result of step 2 is an invalid entity reference, either step 2 should not have been done (leaving ' as-is), or the remaining ampersand should have been converted as well, to yield %26%2339; (update: or perhaps it should have been rendered as  &%2339;)

    The whole mess could have been avoided if the original apostrophe had been converted to %27, though I'm not sure from your description whether this would actually work either...

    Another update: As for actually dealing with that, maybe you want to "pre-condition" the text before passing it to XML::TreeBuilder -- e.g. if you have the whole xml string in a scalar called "$text", you could do this:

    $text =~ s/\%([0-9A-F]{2})/chr(hex($1))/eg;
    (or be more particular/ad-hoc, and just do  s/\%23/\#/g;) Then pass $text to Treebuilder. That might put things right.
Re: XML::TreeBuilder invalid token problem
by roboticus (Chancellor) on Apr 07, 2007 at 14:43 UTC
    cormanaz:

    Just guesses, but:

  • Perhaps the text '%2339;' was originally '#39;'(1) but was incorrectly escaped (missing ';' after the first 3)?
  • Perhaps the '%2339;' represents a unicode character, and the parser doesn't like it?
  • ...roboticus

    (1) Assuming that 0x23 is '#', which I *think* it is, but not certain.

      The feed is UTF-8 and I decoded it before sending it to the parser, so hopefully it is not the latter.
Re: XML::TreeBuilder invalid token problem
by ikegami (Patriarch) on Apr 07, 2007 at 18:59 UTC
    To put it simply, the uri wasn't XML-escaped before being placed into the XML, resulting in invalid XML. & should have been transformed to &.
Re: XML::TreeBuilder invalid token problem
by john_oshea (Priest) on Apr 08, 2007 at 10:40 UTC

    In addition to the excellent answers you've already had, you might want to have a look at XML::Liberal. It has a number of 'remedies' for working around badly-formed XML, one of which (the 'EntityRef' one) looks like it might do the trick in this particular case. Not quite as good as getting the feed fixed, but it may be an option...

    Hope that helps