lilalfyalien has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have the following scenario: I have some XML like (psuedo- not real data):
<?xml version="1.0" encoding="UTF-8"?> <definition> <property name="irrelevant"></property> </definition> <definition> <property name="youaretheoneiwant"> <![CDATA[ <!doctype html> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Some UTF-8 characters hereí</title> </head> <body> <div>Some more UTF-8 characters hereč</div> <div><span>±There could be UTF-8 characters anywhere</ +span></div> </body></html> ]]> </property> <property name="idontcareaboutyou"> </property> </definition>
I need to be able to be able to replace all the UTF-8 characters with their html entity codes. I only have access to a finite set of perl modules and cannot install anymore, as this needs to be run on servers that I have no admin rights to. ---- So far I have used XML:Twig to get to the property value in youaretheoneiwant, but when I run:
encode_entities( $youaretheoneiwantValue , '&\'"[]\200-\377' );
It encodes all of my HTML tags too, even though I though I'd told it to ignore <> characters.
#!/usr/bin/perl use strict; use warnings; use utf8; use XML::Twig; use HTML::Entities; use HTML::Parser; my $xml = $ARGV[0] or die "Usage: format_html_nicely.pl XML_DATA\n"; #print $xml; my $twig = XML::Twig->new( pretty_print => 'indented', twig_handlers => { property => \&encodeCorrectly +}); $twig->parse( $xml ); $twig->flush; exit; sub encodeCorrectly { my( $twig, $property)= @_; if($property->att('name') eq 'youaretheoneiwant') { my $htmlToEncode = $property->text; my $htmlEncoded encode_entities( $htmlToEncode , '&\'"[]\200-\ +377' ); #print "\n\n\n\n\n" . $htmlEncoded ."\n\n\n\n\n"; $property->set_text( $htmlEncoded ); #print "\n\n\n\n\n" . $property->text ."\n\n\n\n\n"; $twig->flush; } }
I'm not convinced I'm taking the right approach? Can anyone offer any advice? Thanks!

Replies are listed 'Best First'.
Re: HTML encoding UTF-8 characters in an HTML block
by Anonymous Monk on Dec 22, 2014 at 17:15 UTC
    It encodes all of my HTML tags too, even though I though I'd told it to ignore <> characters.
    Are you saying encode_entities encodes < and > given the string '&\'"[]\200-\377'? I'm looking at it's source and don't see how that's possible... OTOH, it will encode " (quote) in "http://www.w3.org/1999/xhtml", which will break HTML. It seems you'll have to parse HTML too. And maybe you'll have to decode text that XML::Twig returns (using Encode), and encode it back again... (I don't know how Twig works).
Re: HTML encoding UTF-8 characters in an HTML block
by graff (Chancellor) on Dec 23, 2014 at 07:28 UTC
    I fixed the xml sample you posted (because what you posted was not valid xml), and I made a couple adjustments to your script. Here's the fixed xml:
    <?xml version="1.0" encoding="UTF-8"?> <foo> <definition> <property name="irrelevant"></property> </definition> <definition> <property name="youaretheoneiwant"> <![CDATA[ <!doctype html> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Some UTF-8 characters hereí</title> </head> <body> <div>Some more UTF-8 characters hereč</div> <div><span>±There could be UTF-8 characters anywhere</ +span></div> </body></html> ]]> </property> <property name="idontcareaboutyou"> </property> </definition> </foo>
    Note the addition of a root element around the two "definition" elements. And here's the fixed script:
    #!/usr/bin/perl use strict; use warnings; use utf8; use XML::Twig; use HTML::Entities; use HTML::Parser; my $xml = $ARGV[0] or die "Usage: $0 file.xml\n"; #print $xml; my $twig = XML::Twig->new( pretty_print => 'indented', twig_handlers => { '#CDATA' => \&encodeCorrectly }); $twig->parsefile( $xml ); $twig->flush; exit; sub encodeCorrectly { my( $twig, $property)= @_; my $htmlToEncode = $property->text; my $htmlEncoded = encode_entities( $htmlToEncode, '&\'"[]\200-\377 +' ); # print "\n\n\n" . $htmlEncoded ."\n\n\n"; $property->set_text( $htmlEncoded ); # print "\n\n\n" . $property->text ."\n\n\n"; }
    Note that I'm invoking the "encodeCorrectly" handler on #CDATA, rather than on the "property" element that contains the CDATA. (Also, I chose to put the xml data into a file, and provide the file name as a command-line arg, so I'm using "parsefile" instead of just "parse".)

    (updated the script to fix the "Usage" text, because I'm compulsive about that.)

    For reasons that I don't fully understand, there's something special about the way CDATA is being used in your xml snippet - and how that gets "flushed" - so it was XML::Twig's "flush" operation (not your re-encoding function) that was causing the trouble.

    I don't know whether that fix will work for your intended/actual data, but it seems to do what you want for the posted/fixed sample data.

    BTW, if it's safe to assume that non-ASCII characters never show up as part of the markup (names of tags or attributes) in your xml data, and if it would be sufficient to use numeric character entity references rather than symbolic ones, you could just re-code your data like this:

    perl -CS -pe 's/([^[:ascii:]])/sprintf("&#%d",ord($1))/eg' < orig.xml +> encoded.xml
    That just replaces every non-ASCII character with the corresponding &#(numeric); entity.

    ANOTHER UPDATE: Actually, it's not so mysterious why the CDATA part of your xml snippet was causing trouble (and actually, it was your encoding function that caused the trouble): in order for the CDATA thing to work as intended, the square brackets need to be left as-is. If you want to try your original code again, just leave the square brackets out of the "encode_entities" call. Also, please see the HTML::Entities docs for handling unicode characters above \377 (U+00FF), in case any such happen to occur in your data. (It's not unusual to run into something like U+2019 used as an apostrophe or single-quote.) Note that the one-liner above handles all code points outside the ASCII range.