I did understand that the OP is trying to clean up XML-like stuff: greater and less than signs in the data portion of arbitrary tags is not valid XML, so by definition, this parsing exercise is a clean up exercise.
However, it doesn't change my observation that handling this with regexen will (a) apply only to special cases (b) will require complex regexen and (c) will require going beyond the regex paradigm. For example, the following code will parse his XML sample correctly, but it only works if we can guarantee that same named tags are never nested.
use strict;
use warnings;
my $str = '<Data1>Data</Data1><Data2></Data2><Data3> < </Data3>';
# this code only works if same name tags are never nested
# in your XML-like samples.
$str =~ s/^\s+//;
my $sResult='';
while ($str =~ m{<(\w+)>((?:[^<]|<(?!/\1>))*)</\1>\s*}g) {
my $tag = $1;
my $innards = $2;
$innards =~ s/</</;
$innards =~ s/>/>/;
$sResult .= "<$tag>$innards</$tag>";
}
print STDERR "output: $sResult\n";
Your own module (HTML::JFilter) handles the nested case with grace and it even uses only regular expressions, but you can hardly claim this is a simple set of regular expressions:
sub PolishHTML {
my $str = shift;
if ($AllowXHTML) {
$str =~ s{(.*?)(&\w+;|&#\d+;|<\w[\w\d:\-]*(?:\s+\w[\w\d:\-]*(?:\s*
+=\s*(?:[^" '><\s]+|(?:'[^']*')+|(?:"[^"]*")+))?)*\s*/?>|</\w[\w\d:\-]
+*>|<!--.*?-->|$)}
{HTML::Entities::encode($1, '^\r\n\t !\#\$%\"\'-;=?-~
+').$2}gem;
} else {
$str =~ s{(.*?)(&\w+;|&#\d+;|<\w[\w\d:\-]*(?:\s+\w[\w\d:\-]*(?:\s*
+=\s*(?:[^" '><\s]+|(?:'[^']*')+|(?:"[^"]*")+))?)*\s*>|</\w[\w\d:\-]*>
+|<!--.*?-->|$)}
{HTML::Entities::encode($1, '^\r\n\t !\#\$%\"\'-;=?-~
+').$2}gem;
}
return $str;
}
Given the complexities of writing and maintaining this sort of code, relying on pre-built and pre-tested modules (such as you have suggested) is very good idea. Even so the modules need to be carefully evaluated to make sure they can handle the particular range of XML-like text one needs to process.
Best, beth
Update: fixed typo in my code ((?:[^>]|< was a typo. Should have been (?:[^<]|< |