I'm parsing XML with the XML::Parser module.

The thing works fine, except when it encounters an ampersand in non-markup data (anything that's not a tag):

<writeup node_id="980117" reputation="0" createtime="2001-03-12 17:27: +54">M&M McFlurry (thing)</writeup>

According to XML::Parser, this is not well-formed XML data, because of the & in M&M.

Is there any way I can get around this? Perhaps change ampersands to their HTML entity equivalent (&amp;) in my character event handler?

Here's my code:

sub parseUserSearchXML { my $XMLParser = new XML::Parser(Handlers => {Start => \&startHandl +er, End => \&endHandler, Char => \&charHandler}); my $node; $XMLParser->parsefile($filename); } # event handler for XML::Parser - start tag event sub startHandler { my ($expat, $tag, %attributes) = @_; $buffer = ''; unless($tag =~ /$tags_to_ignore/o) { %temp = %attributes; } } # event handler for XML::Parser - non-markup event sub charHandler { my ($expat, $string) = @_; $buffer .= $string; } # event handler for XML::Parser - end tag event sub endHandler { my ($expat, $tag) = @_; unless($tag =~ /$tags_to_ignore/o) { $buffer =~ s/ \($crap_to_remove\)$//o; # st +rip (person) (place) (thing) or (idea) $nodes{$buffer} = {%temp}; } $buffer = ''; }

---
donfreenut

In reply to An ampersand is not well-formed XML data? by donfreenut

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.