This is not an easy problem. It's quite easy to get partial solutions, and real hard to get a perfect one. The good news is that as long as you are conservative in what you fix, then the XML parser will tell you about what you missed and no one will be hurt in the process ;--)

Also I would assume that what you get is not pathological, designed to trip the parser, but more like "XML by dummies", who don't know the spec, or what a parser is. So probably no CDATA section, no comments, no '>' in attribute values.

My first attempt would look like this:

If we find 2 successive '>' without a '<' in between, then the second '>' should be turned into an entity (the first one closes a tag, but not the second one). Same with 2 successive '<' without a '>' in between, the first '< is not part of the markup (the second one opens a tag, but not the first one). For &, if it doesn't look like an entity, &name; or &#..., then turn it into &amp;

#!/usr/bin/perl use strict; use warnings; while( <DATA>) { s{>([^<]*)>}{>$1&gt;}g; s{<([^>]*)<}{&gt;$1<}g; s{&(?!\w+;|#)}{&amp;}g; print; } __DATA__ <doc><data>></data><data>if( 1 < 2 && 2 < 3)</data></doc>

This doesn't catch the case of an < / > pair that's not part of a tag, as in 'if( $a<$b || $a > $c)'. You can improve this by first trying to catch separately <s, they're easier than >s, as if they are not followed by /?\w+, then they can't be mark-up (once again a simplification, the first character of the tag name can't be a digit).

Also some constructs that might look like entities but are not, like '&#foo', and you could also improve the regexp there. But we are getting to the limits of what's reasonable here.

It all depends of what you want. Limit the number of cases where you have to manually fix the data, or never encounter any well-formedness error.

<pEdited: improved explanations (hopefully!)


In reply to Re: Regular expression to replace xml data by mirod
in thread Regular expression to replace xml data by dalegribble

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.