bear0053 has asked for the wisdom of the Perl Monks concerning the following question:

in an ideal situation &,',",<,> etc should not be in the incoming XML. however i have run into some situations where this "bad" data has to be sent in and needs to be preserved from the incoming xml after it is parsed. But i cannot parse it with the reserved characters in it. I am having trouble running through the xml and converting the reserved characters: < to !40 etc. because it will convert all the < characters to !40. I need to break my xml apart so that i only call the convertchar function on the data elements of the xml and not on the tags themselves.
$incoming_xml = ' <?xml version="1.0" encoding="UTF-8"?> <TRANSACTION> <FIELDS> <FIELD KEY="user">name</FIELD> <FIELD KEY="password">pass&word</FIELD> <FIELD KEY="operation_type"><do_what></FIELD> </FIELDS> </TRANSACTION>';
idea: split the incoming xml at the >DATA TO BE CONVERTED or WHITESPACE< and just call the convertchar method on the data between the >< I will need to be sure that i keep the xml syntax exactly how it should be. so based on my example $incoming_xml once the regex is run the converted xml should look like:
$converted_xml = ' <?xml version="1.0" encoding="UTF-8"?> <TRANSACTION> <FIELDS> <FIELD KEY="user">name</FIELD> <FIELD KEY="password">pass!38word</FIELD> <FIELD KEY="operation_type">!40do_what!41</FIELD> </FIELDS> </TRANSACTION>';
once converted i will send it through XMLin then i can loop through the resulting hash and convert the !# back to their original values therefore preserving the necessary bad values.

i am having trouble creating a regex to do this though. Once again i appreciate all your help.

Replies are listed 'Best First'.
Re: regex on XML
by mirod (Canon) on Feb 17, 2004 at 21:31 UTC

    I have been down that road a couple of times, and ended up with ugly regexps that tried to identify whether < or & where part of the markup or needed to be escaped. Basically a < that's not followed by a letter should probably be escaped, and a & that's not followed by (#x?\d+|\w+;) should be escaped. Be sure to trace what you replace so you can spot problems.

    Down this path lies madness though. If the provider of the data claims it's XML, then you usually have a good leverage to force them to fix it at the source. That's the sanest way to go. A little work on their part (maybe you can help them) will save you and eventually them lots of headaches down the road.

    Just for the fun, I have actually used an other (wrong) option: provided the XML is close enough to SGML, and has a DTD (or you can write its DTD easily), you can try using sx (also called osx in some linux distributions) to convert the SGML to XML. SGML is actually much more lax about what needs to be escaped, the parser will try to figure out whether a < or & is a separate token, or part of the markup. But once again that's just a stop gap (and probably quite a hard one to set-up), try to get the "quasi-XML" to be XML, and spend your time doing useful things instead of fixing other people's mistakes.

Re: regex on XML
by diotalevi (Canon) on Feb 17, 2004 at 19:51 UTC
Re: regex on XML
by CountZero (Bishop) on Feb 17, 2004 at 20:30 UTC
    This is a major problem. Normally, I would suggest you use some form of XML-parser which would give you a nice separation between tags and "content" and then you run HTML::Entities on the content.

    But of course any decent XML-parser (such as XML::Parser) will choke on this "bad" XML (I wonder if technically it is even XML due to the missing encoding of 'forbidden' characters).

    Therefore I suggest that you try to capture these 'forbidden' characters before they enter your XML. Can't you run the encode-function of HTML::Entities on the incoming data, prior to it being XML-ized?

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: regex on XML
by injunjoel (Priest) on Feb 18, 2004 at 01:56 UTC
    Greetings all,
    Here is my general idea... not all that elegant I know but it does what the OP is asking... I think.
    #!/usr/bin/perl -w use strict; my $incoming_xml = ' <?xml version="1.0" encoding="UTF-8"?> <TRANSACTION> <FIELDS> <FIELD KEY="user">name</FIELD> <FIELD KEY="password">pass&word</FIELD> <FIELD KEY="operation_type"><do_what></FIELD> </FIELDS> </TRANSACTION>'; print $incoming_xml; print "\n==============\n"; $incoming_xml =~ s/(<FIELD[^>]*>)(.*)(<\/FIELD>)/$1.&html_transliterat +e($2).$3/eg; print $incoming_xml; exit; sub html_transliterate{ my $in_str = shift; $in_str =~ s/&/!38/g; $in_str =~ s/</!40/g; $in_str =~ s/>/!41/g; return $in_str; }
    Output
    <?xml version="1.0" encoding="UTF-8"?> <TRANSACTION> <FIELDS> <FIELD KEY="user">name</FIELD> <FIELD KEY="password">pass&word</FIELD> <FIELD KEY="operation_type"><do_what></FIELD> </FIELDS> </TRANSACTION> ============== <?xml version="1.0" encoding="UTF-8"?> <TRANSACTION> <FIELDS> <FIELD KEY="user">name</FIELD> <FIELD KEY="password">pass!38word</FIELD> <FIELD KEY="operation_type">!40do_what!41</FIELD> </FIELDS> </TRANSACTION>
    of course this is only working for XML nodes <FIELD> right now but can easily be modified. HTH.
      thanks that is exactly what i was looking for. as usual perlmonks come through again.
Re: regex on XML
by CountZero (Bishop) on Feb 17, 2004 at 20:55 UTC
    I gave it another thought and went to read some books on XML.

    Valid XML only needs & and < to be turned into &amp; and &lt; respectively. You don't have to encode ' " or > at all.

    Replacing all & by &amp; is trivial. This still leaves you with the troublesome <, but I don't think there is any possibility (other than validating your XML against a DTD) to let the script differentiate between <FIELDS> and <do_what>: they both look like a "valid" XML tag.

    Oh yes, just remembered: ]]> must be encoded as ]]&gt; if you want to use it as text.

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

      That's not true. You do have to encode ' and " if you expect to use them inside a quoted string using that character as a delimiter. So <foo bar=" &quot; "/> and <foo bar=' &apos; '/>
        Perhaps, but it has nothing to do with it being valid XML or not: it is just about having balanced quotes.

        CountZero

        "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law