I have been down that road a couple of times, and ended up with ugly regexps that tried to identify whether < or & where part of the markup or needed to be escaped. Basically a < that's not followed by a letter should probably be escaped, and a & that's not followed by (#x?\d+|\w+;) should be escaped. Be sure to trace what you replace so you can spot problems.

Down this path lies madness though. If the provider of the data claims it's XML, then you usually have a good leverage to force them to fix it at the source. That's the sanest way to go. A little work on their part (maybe you can help them) will save you and eventually them lots of headaches down the road.

Just for the fun, I have actually used an other (wrong) option: provided the XML is close enough to SGML, and has a DTD (or you can write its DTD easily), you can try using sx (also called osx in some linux distributions) to convert the SGML to XML. SGML is actually much more lax about what needs to be escaped, the parser will try to figure out whether a < or & is a separate token, or part of the markup. But once again that's just a stop gap (and probably quite a hard one to set-up), try to get the "quasi-XML" to be XML, and spend your time doing useful things instead of fixing other people's mistakes.


In reply to Re: regex on XML by mirod
in thread regex on XML by bear0053

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.