Hi, I have the following scenario: I have some XML like (psuedo- not real data):
<?xml version="1.0" encoding="UTF-8"?> <definition> <property name="irrelevant"></property> </definition> <definition> <property name="youaretheoneiwant"> <![CDATA[ <!doctype html> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Some UTF-8 characters hereí</title> </head> <body> <div>Some more UTF-8 characters hereč</div> <div><span>±There could be UTF-8 characters anywhere</ +span></div> </body></html> ]]> </property> <property name="idontcareaboutyou"> </property> </definition>
I need to be able to be able to replace all the UTF-8 characters with their html entity codes. I only have access to a finite set of perl modules and cannot install anymore, as this needs to be run on servers that I have no admin rights to. ---- So far I have used XML:Twig to get to the property value in youaretheoneiwant, but when I run:
encode_entities( $youaretheoneiwantValue , '&\'"[]\200-\377' );
It encodes all of my HTML tags too, even though I though I'd told it to ignore <> characters.
#!/usr/bin/perl use strict; use warnings; use utf8; use XML::Twig; use HTML::Entities; use HTML::Parser; my $xml = $ARGV[0] or die "Usage: format_html_nicely.pl XML_DATA\n"; #print $xml; my $twig = XML::Twig->new( pretty_print => 'indented', twig_handlers => { property => \&encodeCorrectly +}); $twig->parse( $xml ); $twig->flush; exit; sub encodeCorrectly { my( $twig, $property)= @_; if($property->att('name') eq 'youaretheoneiwant') { my $htmlToEncode = $property->text; my $htmlEncoded encode_entities( $htmlToEncode , '&\'"[]\200-\ +377' ); #print "\n\n\n\n\n" . $htmlEncoded ."\n\n\n\n\n"; $property->set_text( $htmlEncoded ); #print "\n\n\n\n\n" . $property->text ."\n\n\n\n\n"; $twig->flush; } }
I'm not convinced I'm taking the right approach? Can anyone offer any advice? Thanks!

In reply to HTML encoding UTF-8 characters in an HTML block by lilalfyalien

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.