comment on

I did understand that the OP is trying to clean up XML-like stuff: greater and less than signs in the data portion of arbitrary tags is not valid XML, so by definition, this parsing exercise is a clean up exercise.

However, it doesn't change my observation that handling this with regexen will (a) apply only to special cases (b) will require complex regexen and (c) will require going beyond the regex paradigm. For example, the following code will parse his XML sample correctly, but it only works if we can guarantee that same named tags are never nested.

use strict;
use warnings;

my $str = '<Data1>Data</Data1><Data2></Data2><Data3> < </Data3>';

# this code only works if same name tags are never nested
# in your XML-like samples.

$str =~ s/^\s+//;

my $sResult='';
while ($str =~ m{<(\w+)>((?:[^<]|<(?!/\1>))*)</\1>\s*}g) {
  my $tag = $1;
  my $innards = $2;

  $innards =~ s/</&lt;/;
  $innards =~ s/>/&gt;/;
  $sResult .= "<$tag>$innards</$tag>";
}
print STDERR "output: $sResult\n";
[download]

Your own module (HTML::JFilter) handles the nested case with grace and it even uses only regular expressions, but you can hardly claim this is a simple set of regular expressions:

sub PolishHTML {
  my $str = shift;
  if ($AllowXHTML) {
    $str =~ s{(.*?)(&\w+;|&#\d+;|<\w[\w\d:\-]*(?:\s+\w[\w\d:\-]*(?:\s*
+=\s*(?:[^" '><\s]+|(?:'[^']*')+|(?:"[^"]*")+))?)*\s*/?>|</\w[\w\d:\-]
+*>|<!--.*?-->|$)}
                 {HTML::Entities::encode($1, '^\r\n\t !\#\$%\"\'-;=?-~
+').$2}gem;
  } else {
    $str =~ s{(.*?)(&\w+;|&#\d+;|<\w[\w\d:\-]*(?:\s+\w[\w\d:\-]*(?:\s*
+=\s*(?:[^" '><\s]+|(?:'[^']*')+|(?:"[^"]*")+))?)*\s*>|</\w[\w\d:\-]*>
+|<!--.*?-->|$)}
                 {HTML::Entities::encode($1, '^\r\n\t !\#\$%\"\'-;=?-~
+').$2}gem;
  }
  return $str;
}
[download]

Given the complexities of writing and maintaining this sort of code, relying on pre-built and pre-tested modules (such as you have suggested) is very good idea. Even so the modules need to be carefully evaluated to make sure they can handle the particular range of XML-like text one needs to process.

Best, beth

Update: fixed typo in my code ((?:[^>]|< was a typo. Should have been (?:[^<]|<

In reply to Re^2: Regular expression to replace xml data by ELISHEVA
in thread Regular expression to replace xml data by dalegribble

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.