donfreenut has asked for the wisdom of the Perl Monks concerning the following question:


I'm parsing XML with the XML::Parser module.

The thing works fine, except when it encounters an ampersand in non-markup data (anything that's not a tag):

<writeup node_id="980117" reputation="0" createtime="2001-03-12 17:27: +54">M&M McFlurry (thing)</writeup>

According to XML::Parser, this is not well-formed XML data, because of the & in M&M.

Is there any way I can get around this? Perhaps change ampersands to their HTML entity equivalent (&amp;) in my character event handler?

Here's my code:

sub parseUserSearchXML { my $XMLParser = new XML::Parser(Handlers => {Start => \&startHandl +er, End => \&endHandler, Char => \&charHandler}); my $node; $XMLParser->parsefile($filename); } # event handler for XML::Parser - start tag event sub startHandler { my ($expat, $tag, %attributes) = @_; $buffer = ''; unless($tag =~ /$tags_to_ignore/o) { %temp = %attributes; } } # event handler for XML::Parser - non-markup event sub charHandler { my ($expat, $string) = @_; $buffer .= $string; } # event handler for XML::Parser - end tag event sub endHandler { my ($expat, $tag) = @_; unless($tag =~ /$tags_to_ignore/o) { $buffer =~ s/ \($crap_to_remove\)$//o; # st +rip (person) (place) (thing) or (idea) $nodes{$buffer} = {%temp}; } $buffer = ''; }

---
donfreenut

Replies are listed 'Best First'.
(ar0n) Re: An ampersand is not well-formed XML data?
by ar0n (Priest) on Apr 30, 2001 at 21:03 UTC
    Yup, just change it to &amp;. XML supports - I think - five named entities: &quot;, &apos;, &lt;, &gt; and &amp;. Note that you should also do this for all non-standard characters (i.e. above 127 in the ASCII char set).
    s/([^\x1f-\x80])/"&#".ord($1).";"/ge;
    Our Fearless Leader (tm) had the same problem with the xml nodes, and this is how he fixed, IIRC.

    ar0n ]

(jeffa) Re: An ampersand is not well-formed XML data?
by jeffa (Bishop) on Apr 30, 2001 at 20:59 UTC
    Perhaps change ampersands to their HTML entity equivalent (&) in my character event handler?

    yup, that's how you do it, but don't forget about <, >, and "
    here is one way to handle the problem for all data:

    # global lookup hash my %ESCAPES = ( '&' => '&amp;', '<' => '&lt;', '>' => '&gt;', '"' => '&quot;', ); # the subroutine sub xml_encode { my ($str) = @_; $str =~ s/([&<>"])/$ESCAPES{$1}/ge; return $str; } # and invoke it like $data = xml_encode($data);
    But this is just one way

    Jeff

    R-R-R--R-R-R--R-R-R--R-R-R--R-R-R--
    L-L--L-L--L-L--L-L--L-L--L-L--L-L--
    
      Ampersand needs to be encoded everywhere. Quote needs to be encoded within an quoted argument. Less-than and greater-than need to be encoded outside a quoted argument. It's not an error to encode all four everywhere, but it's overkill.

      -- Randal L. Schwartz, Perl hacker

        Actually greater-than does not need to be encoded at all. There is never any problem with it, as it only has a special meaning at the end of a tag, where regular character data cannot appear. <doc att=">">></doc> is a perfectly valid piece of XML.

        Michel V. Rodriguez, XML Hacker ;--)


      Okay, right on. The problem now is where I should do the encoding. XML::Parser bombs out and dies as soon as it sees the ampersand, before it gets passed to the handler.

      I want to be able to either scan the XML from a file or get it from a socket. Am I going to have to read the data from one of those two places first, do the encoding, then have XML::Parser parse the results? That seems hard, because I'd have to decide before parsing what should be parsed (I don't want to go replacing the quotes around XML attributes with &quot; - the XML parser wouldn't be able to parse).

      Is there some easier way to do the encoding? Is there any way at all I can keep XML::Parser from crapping out before I get a chance to replace the ampersand?

      Thanks...
      ---
      donfreenut
        Okay, right on. The problem now is where I should do the encoding. XML::Parser bombs out and dies as soon as it sees the ampersand, before it gets passed to the handler.
        It needs to get done before it ends up as so-called XML. It's not XML if the encoding hasn't been done. Go upstream and fix the problem there. If you are getting files in that format, scream at the provider. For them to call it XML is doing a disservice to the meaning of what XML's about.

        -- Randal L. Schwartz, Perl hacker

Re: An ampersand is not well-formed XML data?
by traveler (Parson) on Apr 30, 2001 at 21:05 UTC
    Quoting XML1.0 Appendix D <quot> An Ampersand (&) may be escaped numerically (&#38;) or with a general entity (&amp;). </quot> Thus either method should work.
Re: An ampersand is not well-formed XML data?
by little (Curate) on Apr 30, 2001 at 21:43 UTC
    inbetween XML tags you must enclose entities in CDATA sections eg:
    <whatever> <![CDATA[brown &amp; schwiggs]]> </whatever>


    Have a nice day
    All decision is left to your taste
      You only need to enclose undeclared entities in CDATA sections. If you declare your entities in a DTD(internal or external) you can use them normally.
      <?xml version="1.0"?> <!DOCTYPE sanity [ <!ENTITY check "&#666;"> ]> <sanity> &check; </sanity>
      I usually inlcude the three xhtml character entity sets in my DTD. This way I don't get errors if somebody slips in a £ or a ¥
      <!-- SPECIAL CHARACTER ENTITY SETS DECLARED AND REFERENCED HERE --> <!ENTITY % xhtml-lat1 PUBLIC "-//W3C//ENTITIES Latin 1 for XHTML//E +N" "xhtml-lat1.ent"> %xhtml-lat1; <!ENTITY % xhtml-special PUBLIC "-//W3C//ENTITIES Special for XHTML//E +N" "xhtml-special.ent"> %xhtml-special; <!ENTITY % xhtml-symbol PUBLIC "-//W3C//ENTITIES Symbols for XHTML//E +N" "xhtml-symbol.ent"> %xhtml-symbol;
      Update:Here are the links to those character entity sets

      Latin 1
      Special
      Symbols

      Get Strong Together!!