bear0053 has asked for the wisdom of the Perl Monks concerning the following question:

i am currently using xml::simple to parse xml data and encountered the following problem:

when i send the xml to the parser it works fine as long as there are no special character in the xml data.
sample xml attempting to be parsed: <?xml version="1.0" encoding="UTF-8"?> <TRANSACTION> <FIELDS> <FIELD KEY="user">name</FIELD> <FIELD KEY="password">pass&word</FIELD> <FIELD KEY="operation_type">do_what</FIELD> </FIELDS> </TRANSACTION>
i call XMLin on the xml and it generates this error: not well-formed at line 1, column 124, byte 124 at Parsing Script line 168

this correspondes to the &
Right now i am going to escape all the reserved characters in the xml string then send it to XMLin then loop through the resulting keys and unescape the escaped characters to avoid this problem. What i am looking for though is a cleaner alternative to this. Thanks in advance.

Replies are listed 'Best First'.
Re: parsing reserved chars with xml::simple
by Vautrin (Hermit) on Feb 17, 2004 at 16:14 UTC
    Well, & is a reserved characther. You need to escape it as an entity, i.e. &amp;. If you do not like escaping entities, you can use the following paradigm:
    <![CDATA[DATA GOES HERE]]>
    Using CDATA tags, you can put whatever you want within the [], with the exception of the end tag, ]]>. Follow this link to safari for a bigger example.

    Want to support the EFF and FSF by buying cool stuff? Click here.
      You should also test XML against a DTD or XML Schema. This will immediately validate the XML content before it is sucked in by your app. Working in this way ensures that you spend your app programming time solving the problem rather than error handling badly formed XML.

      See this article introducing XML Schema.

•Re: parsing reserved chars with xml::simple
by merlyn (Sage) on Feb 17, 2004 at 16:15 UTC
    Why would you want an XML parser to parse stuff that isn't XML?

    Down that path lies the madness of the variously differing error-correction schemes used by the "HTML" "parsers". And rightfully so, the XML creators said "no way" to that.

    If it's an XML parser, it's supposed to throw a fit when you feed it something other than XML. You should be feeding it XML, not non-XML.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

Re: parsing reserved chars with xml::simple
by snowcrash (Friar) on Feb 17, 2004 at 16:14 UTC
    Well, your input is not well-formed xml, because it contains an ampersand. According to the xml spec it has to be written as the string "&amp;". I think it's the cleanest way to create well-formed xml input, but you could also write some kind of preprocessor that converts the not so valid xml to xml.
Re: parsing reserved chars with xml::simple
by csuhockey3 (Curate) on Feb 17, 2004 at 17:07 UTC
    Doesn't really apply to your question, but I got some great feedback on xml::simple that might be helpful way back -- here is that thread.

    -CSUhockey3
Re: parsing reserved chars with xml::simple
by bear0053 (Hermit) on Feb 17, 2004 at 18:52 UTC
    thanks for all your help.

    I agree...in an ideal situation &,',",<,> etc should not be in the incoming XML. however i have run into some situations where this "bad" data has to be sent in and needs to be preserved from the incoming xml after it is parsed. But i cannot parse it with the reserved characters in it. I am having trouble running through the xml and converting the reserved characters: < to !40 etc. because it will convert all the < characters to !40. I need to break my xml apart so that i only call the convertchar function on the data elements of the xml and not on the tags themselves.
    $incoming_xml = ' <?xml version="1.0" encoding="UTF-8"?> <TRANSACTION> <FIELDS> <FIELD KEY="user">name</FIELD> <FIELD KEY="password">pass&word</FIELD> <FIELD KEY="operation_type"><do_what></FIELD> </FIELDS> </TRANSACTION>';
    idea: split the incoming xml at the >DATA TO BE CONVERTED or WHITESPACE< and just call the convertchar method on the data between the ><
    I will need to be sure that i keep the xml syntax exactly how it should be. so based on my example $incoming_xml once the regex is run the converted xml should look like:
    $converted_xml = ' <?xml version="1.0" encoding="UTF-8"?> <TRANSACTION> <FIELDS> <FIELD KEY="user">name</FIELD> <FIELD KEY="password">pass!38word</FIELD> <FIELD KEY="operation_type">!40do_what!41</FIELD> </FIELDS> </TRANSACTION>';
    once converted i will send it through XMLin then i can loop through the resulting hash and convert the !# back to their original values therefore preserving the necessary bad values.

    i am having trouble creating a regex to do this though. Once again i appreciate all your help.

      s/&(?!amp|quot|apos|lt|gt)/&amp;/g; may work for you in some simple cases. This replaces ampersands that are not immediately followed by something that might be a valid tag with an escaped version.

        that works for & but it won't work for <,>,',"
        regex are: $text =~ s/&(?!amp|quot|apos|lt|gt)/!38/g; $text =~ s/"(?!amp|quot|apos|lt|gt)/!34/g; $text =~ s/<(?!amp|quot|apos|lt|gt)/!40/g; $text =~ s/>(?!amp|quot|apos|lt|gt)/!41/g; $text =~ s/'(?!amp|quot|apos|lt|gt)/!39/g;
        when i run that on my xml i get:
        !40?xml version=!341.0!34 encoding=!34UTF-8!34?!41 !40TRANSACTION!41 !40FIELDS!41 !40FIELD KEY=!34user!34!41name!40/FIELD!41 !40FIELD KEY=!34password!34!41pass!38word!40/FIELD!41 !40FIELD KEY=!34operation_type!34!41!40do_what!41!40/FIELD!41 !40/FIELDS!41 !40/TRANSACTION!41
        so i am still in the same spot i need to be able to only manipulate the data in between > <

        thanks again for your help
      What do you mean by "where this 'bad' data has to be sent in and needs to be preserved from the incoming xml after it is parsed". The reserved characters should always be escaped in the XML or it isn't valid XML. If the source of the XML is not escaping the invalid characters, it is a bug in that program and should be fixed.

      Most parsers, including XML::Simple will decode the entities into the characters. Most generators will do the same transformation. Most parsers have options to control if they parse the entities or not.

Re: parsing reserved chars with xml::simple
by iburrell (Chaplain) on Feb 17, 2004 at 21:07 UTC
    XML::Simple will change all the entities for special characters into the characters when it parses text and attribute values.

    You can't easily escape the reserved characters in the not-quite XML without a heuristic parser. For example, & is the entity for encoding &. You wouldn't want to turn that into &amp;. Also, < and > are reserved characters with < and > for entities. How would you distinguish between bad characters in text and the real tags?

    The only place that knows what is text and what is elements is the source of the XML. You need to fix the source of the XML that is not encoding special characters to entities in text and attribute values.

Re: parsing reserved chars with xml::simple
by Anonymous Monk on Feb 19, 2004 at 16:57 UTC
    Your problem, as several people have pointed out, is that the thing that's handing you data isn't handing you XML. Instead, it's almost XML. Well, if I pass my machine binary code that is almost a valid program, but not quite, I get a core dump (as I should). If someone is contractually required to be handing you XML, you should bring this up with your manager. For the record, what the people on the other end should be sending you is:
    <?xml version="1.0" encoding="UTF-8"?> <TRANSACTION> <FIELDS> <FIELD KEY="user">name</FIELD> <FIELD KEY="password">pass&amp;word</FIELD> <FIELD KEY="operation_type">do_what</FIELD> </FIELDS> </TRANSACTION>
    Instead of &amp;, they could also say &#38; or &#x16;. The point is, their process isn't generating XML. For another example, consider that print XMLout({password => ['pass&word']}); produces:
    <opt> <password>pass&amp;word</password> </opt>
    Assuming that the other end doesn't fix this, there are at least two ways to try to fix this in your code. One, as another poster alluded to, is to wrap the offending section in CDATA. If you happen to know that the process generating this almost-XML-but-not-quite always puts its garbage inside FIELD tags, and that the open and close FIELD is always on the same line, you could do this before sending your output to the xml processor:
    $xmltext =~ s{(<FIELD .*?>)(.*)(</FIELD)}{$1<![CDATA[$2]]>$3}
    Of course, this has the disadvantage that if your source ever starts to send XML documents that are in fact XML documents, with everything escaped as it should be, then your code will now read it incorrectly.

    Since your input isn't XML, you could also try parsing it by something that's not an XML parser, as suggested earlier by the reference to the wonderful article The Wrong Parser for the Right Reasons. For example, stealing copiously from that article, and using the HTML::Stream class for output, here's something that turns that particular bad non-XML into proper XML. This code simply assumes that anything between <FIELD> and </FIELD> was meant to be plain text, and not pieces of tags, or whatever. I've taken the liberty of adding several messed-up password examples:

    #!perl use HTML::Parser; use HTML::Stream; # see HTML::Stream documentation for how to output to a scalar $output = new HTML::Stream \*STDOUT; # inside FIELDs, ignore stuff that looks like tags $insidefield = 0; my $p = HTML::Parser->new ( xml_mode => 1, start_h => [sub { my ($tagname, $attr,$text) = @_; if ($insidefield) {$output->text($text);} else {$output->tag($tagname, %$attr);} if ($tagname eq 'FIELD') {$insidefield = 1;} }, "tagname, attr,text"], text_h => [sub { my ($text) = @_; $output->text($text); }, "dtext"], end_h => [sub { my ($tagname, $text) = @_; if ($tagname eq 'FIELD') {$insidefield = 0;} if ($insidefield) {$output->text($text);} else {$output->tag("_$tagname");} }, "tagname, text"], ); $p->parse_file(\*DATA); __END__ <TRANSACTION> <FIELDS> <FIELD KEY="user">name</FIELD> <FIELD KEY="passwordlt">pass<word</FIELD> <FIELD KEY="passwordamp">pass&word</FIELD> <FIELD KEY="passwordgt">pass>word</FIELD> <FIELD KEY="passwordalmost">pass&ampword</FIELD> <FIELD KEY="passwordampcorrect">pass&amp;word</FIELD> <FIELD KEY="passwordtag">pa<ssw>ord</FIELD> <FIELD KEY="operation_type">do_what</FIELD> </FIELDS> </TRANSACTION>
    This code produces:
    <TRANSACTION> <FIELDS> <FIELD KEY="user">name</FIELD> <FIELD KEY="passwordlt">pass&lt;word</FIELD> <FIELD KEY="passwordamp">pass&amp;word</FIELD> <FIELD KEY="passwordgt">pass&gt;word</FIELD> <FIELD KEY="passwordalmost">pass&amp;ampword</FIELD> <FIELD KEY="passwordampcorrect">pass&amp;word</FIELD> <FIELD KEY="passwordtag">pa&lt;ssw&gt;ord</FIELD> <FIELD KEY="operation_type">do_what</FIELD> </FIELDS> </TRANSACTION>
    Note that passwordampcorrect is passed through unchanged. This is because that text was already valid. (It had everything already properly escaped - so that if your data supplier fixes their stuff, you'll still be ok) If you want to massage that content too, just change the "dtext" to "text". Also, if you're not worried about downright pathological text such as the passwordtag example, the program above can be simplified substantially by removing the bits that change the $insidefield variable and replacing the if ($insidefield) bits with the else block.

    Note that this program still fails to make heads or tails of things if someone wants to make "</FIELD>" their password. When that happens, you need to really start beating on the source of your data to send you valid XML.