skinnymofo has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,
  Anyone have a regex or code that can escape the XML reserved characters &, <, >, '," embedded in the data portions of the XML??
  Unfortunately, I can't use the Perl XML modules b/c I am on an older UNIX (System V, Perl 5.005) system and I don't have the rights to install any modules.
  Basically, If I have some XML like this:
<Root> <Node1>Data with < in it</Node1> <Node2>Data with > in it</Node2> </Root>

  I would like to turn the >, and < into &gt;, &lt;, respectively.

Any help is much appreciated,
Skinny

Replies are listed 'Best First'.
Re: Escaping XML Reserved characters
by samtregar (Abbot) on Sep 14, 2003 at 07:15 UTC
    No problem:

    $data =~ s/&/&amp;/g; $data =~ s/</&lt;/g; $data =~ s/>/&gt;/g; print "<Node1>$data</Node1>\n";

    Of course, you should know that you don't need administrative rights to use Perl modules. Just put them in a directory of your own and push that library onto @INC. For example, if I installed XML::Writer as /home/sam/modules/XML/Writer.pm I'd say:

    BEGIN { push(@INC, '/home/sam/modules'); } use XML::Writer;

    Give it a try!

    -sam

Re: Escaping XML Reserved characters
by hawtin (Prior) on Sep 14, 2003 at 07:47 UTC

    As well as modifying @INC you can also install your own private modules and change the $PERL5LIB environment variable to point to the directory where you installed them.

    In fact you would probably gain more by learning to access privately installed CPAN modules within the restrictions placed by your systems rather than reinventing wheels IMHO.

Re: Escaping XML Reserved characters
by graff (Chancellor) on Sep 15, 2003 at 06:11 UTC
    Granting that someone is providing you with data where these special characters have not been "escaped" by using entity reference forms, it's possible that the parser modules might have some trouble with this, and you'll probably have to doctor the data with regexes first, then validate your fixed data using the a parser module (or using James Clark's "nsgmls" utility, which comes with his "expat" package, which you would need to install anyway, in order to install the perl XML parser modules...)

    The approach I would suggest (having faced a similar problem many times) is to diagnose the data thoroughly first: figure out the patterns that represent the full inventory of XML tags and entities that are being used properly, and then look for cases of the special characters that do not occur in these patterns. If the example you gave is typical, it could be as simple as finding all the cases of special characters that are bounded by whitespace on both sides.

    My operating assumption would be that folks who create XML this way will tend to have a fairly simple tagging design, not using any really sophisticated syntax that would hose a regex solution. (Sure, it could also mean that they're really sloppy and/or stupid, and anything might happen...)

    Still, a first pass scan that is likely to help you comprehend the situation might be something like:

    #!/usr/bin/perl $/ = "</Root>\n"; # let's read whole structures while (<>) { my $chk = $_; # make a copy that we can muck with $chk =~ s{</?\w+/?>}{}g; # remove known "good tags" patterns my $prob = ( $chk =~ /[<>]/ ) ? 'stray angle bracket(s)' : ''; $chk =~ s{\&\w+\;}{}g; # remove known "good entities" $prob .= ' stray ampersand(s)' if ( $chk =~ /\&/ ); print "Record $. has $prob:\n$_" if $prob; }
    Another thing I have to do from time to time is a sanity check -- e.g. tag-like behavior usually involves a very small type/token ratio (the number of distinct tags is small, and their frequency of occurrence is relatively high), and of course in XML, tags must either have a slash at the end of the tag name, or else have equal numbers of open and close tags. So, count up the occurrences of each thing that looks like a tag, and see if there are any outliers -- this is easy with a unix command line:
    # perl 1-liner to output one "tag" per line: perl -pe 's{^.*?<}{}; s{>[^<]*}{>\n}g;' file.xml | sort | uniq -c

    Spend some time reviewing the data this way to make sure your regexes can correctly identify all non-tag, non-entity uses of these characters, then adapt those regexes to do the necessary substitutions.

    Actually, I believe it's the case that when these special characters are delimited on both sides by whitespace, parsers don't have a problem with them: behold that these three -- < & > -- have all been typed as-is with spaces around them (not as entity references, and not inside "pre" or "code" tags). So maybe your data suppliers aren't really screwing up at all.

    But maybe you have some hyper-sensitive process that doesn't like this "liberal" usage, and I suppose it's not uncommon for people (and processes) to take a purist attitude -- once a special character, always a special character, and don't trust something as slippery as whitespace to tell you otherwise.

    (update: fixed the grammar a bit, and tried to make the 1-liner easier to read)

Re: Escaping XML Reserved characters
by stingray (Initiate) on Sep 15, 2003 at 07:07 UTC

    The following appears that it might do what you want. Note though that it will *not* work if you have cdata and elements mixed together.

    #!/usr/bin/perl
    use strict;
    use HTML::Entities qw(encode_entities);
    $| = 1;
    
    my $str = "<Root>
      <Node1>Data with < in it</Node1>
      <Node2>Data with > in it</Node2>
      <Node3>
          <SubNode1>'one'</SubNode1>
          <SubNode2><\"two\"></SubNode2>
      </Node3>
    </Root>";
    
    Match($str);
    
    sub Match
    {
        my $str = shift;
        my $c = shift || 0;
        while ($str =~ m!<(_a-zA-Z0-9+)>(.+?)</\1>!sg)
        {
    	my $tag = $1;
    	my $tmp = $2;
    	printf("%s<$tag>\n", "\t" x $c);
    	# If there are subelements, recurse, otherwise,
    	# just encode and print the data.
    	if ($tmp =~ m!<(_a-zA-Z0-9+)>(.+?)</\1>!)
    	{
    	    Match($tmp, $c + 1);
    	}
    	else
    	{
    	    $tmp = encode_entities($tmp);
    	    printf("%s$tmp\n", "\t" x ($c + 1));
    	}
    	printf("%s</$tag>\n", "\t" x $c);
        }
    }
    

    Outputs:

    <Root>
            <Node1>
                    Data with &lt; in it
            </Node1>
            <Node2>
                    Data with &gt; in it
            </Node2>
            <Node3>
                    <SubNode1>
                            'one'
                    </SubNode1>
                    <SubNode2>
                            &lt;&quot;two&quot;&gt;
                    </SubNode2>
            </Node3>
    </Root>
    

    (Edited for formatting)

    Re: Escaping XML Reserved characters
    by shenme (Priest) on Sep 14, 2003 at 13:49 UTC
      Wait, could you clarify please?   Are you saying you _already_ have XML files with text like you showed above?   Or are you generating the XML files?

      If you're generating the XML files then what samtregar gave you should be enough.   If you really want to escape ' " ' you can add &quot; to his code.

        I already have the XML. That's why samtregar's suggestion isn't ideal.
        Also, as we've seen, escaping the & and the quotes isn't a big deal, it's the < and > that I'm having trouble with.
        A regex like s/>/&gt;/g; would eliminate all of the >, which isn't a good thing.

          You may have more of a dilemma than you realize here: You're saying (I think) that you have existing XML that's non-conforming because it hasn't already used entities for the 3 required characters in the data: <, >, and &. < and > are difficult because you (obviously) want to leave them in the tags. That would mean to get it right, you'd have to parse the XML. However most parsers will, I think, get confused by the unescaped < and > characters. i.e., most parsers expect conforming XML. At any rate, if you can't simply regenerate the XML and do the entity conversion, I think you have to start by trying to parse it. Something like XML::Parser would be where I'd start, but I just tried feeding it some XML without escaped < and > characters and it flagged those as errors. Maybe there are settings that let you get around that.

          For entity conversion (to or from) I like HTML::Entities. XML::Writer is nice in other respects, but it's entity conversion could be better. It only does the three above, it always does them named, and it's not smart enough to realize when there are & characters introducing entities that are already in the data.

    Re: Escaping XML Reserved characters
    by leriksen (Curate) on Sep 16, 2003 at 01:46 UTC
      Its already been touched on by previous comments, but here's the dance I do on my system where I do not have permission to install Perl modules in the defualt area.
      tar zxvf Module-v0.0.tar.gz cd Module-0.0 perl Makefile.PL PREFIX=/path/to/private/install/spot make && make test && make install
      Note that under Perl 5.8 this works great, but <5.8 you may need to also define INSTALLMAN1DIR and INSTALLMAN3DIR to get the doc to install correctly.
      Now make sure you set PERL5LIB in your .localrc or .bashrc or whereever you login stuff goes