Granting that someone is providing you with data where these special characters have not been "escaped" by using entity reference forms, it's possible that the parser modules might have some trouble with this, and you'll probably have to doctor the data with regexes first, then validate your fixed data using the a parser module (or using James Clark's "nsgmls" utility, which comes with his "expat" package, which you would need to install anyway, in order to install the perl XML parser modules...)
The approach I would suggest (having faced a similar problem many times) is to diagnose the data thoroughly first: figure out the patterns that represent the full inventory of XML tags and entities that are being used properly, and then look for cases of the special characters that do not occur in these patterns. If the example you gave is typical, it could be as simple as finding all the cases of special characters that are bounded by whitespace on both sides.
My operating assumption would be that folks who create XML this way will tend to have a fairly simple tagging design, not using any really sophisticated syntax that would hose a regex solution. (Sure, it could also mean that they're really sloppy and/or stupid, and anything might happen...)
Still, a first pass scan that is likely to help you comprehend the situation might be something like:
#!/usr/bin/perl
$/ = "</Root>\n"; # let's read whole structures
while (<>) {
my $chk = $_; # make a copy that we can muck with
$chk =~ s{</?\w+/?>}{}g; # remove known "good tags" patterns
my $prob = ( $chk =~ /[<>]/ ) ? 'stray angle bracket(s)' : '';
$chk =~ s{\&\w+\;}{}g; # remove known "good entities"
$prob .= ' stray ampersand(s)' if ( $chk =~ /\&/ );
print "Record $. has $prob:\n$_" if $prob;
}
Another thing I have to do from time to time is a sanity check -- e.g. tag-like behavior usually involves a very small type/token ratio (the number of distinct tags is small, and their frequency of occurrence is relatively high), and of course in XML, tags must either have a slash at the end of the tag name, or else have equal numbers of open and close tags. So, count up the occurrences of each thing that looks like a tag, and see if there are any outliers -- this is easy with a unix command line:
# perl 1-liner to output one "tag" per line:
perl -pe 's{^.*?<}{}; s{>[^<]*}{>\n}g;' file.xml | sort | uniq -c
Spend some time reviewing the data this way to make sure your regexes can correctly identify all non-tag, non-entity uses of these characters, then adapt those regexes to do the necessary substitutions.
Actually, I believe it's the case that when these special characters are delimited on both sides by whitespace, parsers don't have a problem with them: behold that these three -- < & > -- have all been typed as-is with spaces around them (not as entity references, and not inside "pre" or "code" tags). So maybe your data suppliers aren't really screwing up at all.
But maybe you have some hyper-sensitive process that doesn't like this "liberal" usage, and I suppose it's not uncommon for people (and processes) to take a purist attitude -- once a special character, always a special character, and don't trust something as slippery as whitespace to tell you otherwise.
(update: fixed the grammar a bit, and tried to make the 1-liner easier to read) |