lilalfyalien has asked for the wisdom of the Perl Monks concerning the following question:
I need to be able to be able to replace all the UTF-8 characters with their html entity codes. I only have access to a finite set of perl modules and cannot install anymore, as this needs to be run on servers that I have no admin rights to. ---- So far I have used XML:Twig to get to the property value in youaretheoneiwant, but when I run:<?xml version="1.0" encoding="UTF-8"?> <definition> <property name="irrelevant"></property> </definition> <definition> <property name="youaretheoneiwant"> <![CDATA[ <!doctype html> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Some UTF-8 characters hereí</title> </head> <body> <div>Some more UTF-8 characters hereč</div> <div><span>±There could be UTF-8 characters anywhere</ +span></div> </body></html> ]]> </property> <property name="idontcareaboutyou"> </property> </definition>
It encodes all of my HTML tags too, even though I though I'd told it to ignore <> characters.encode_entities( $youaretheoneiwantValue , '&\'"[]\200-\377' );
I'm not convinced I'm taking the right approach? Can anyone offer any advice? Thanks!#!/usr/bin/perl use strict; use warnings; use utf8; use XML::Twig; use HTML::Entities; use HTML::Parser; my $xml = $ARGV[0] or die "Usage: format_html_nicely.pl XML_DATA\n"; #print $xml; my $twig = XML::Twig->new( pretty_print => 'indented', twig_handlers => { property => \&encodeCorrectly +}); $twig->parse( $xml ); $twig->flush; exit; sub encodeCorrectly { my( $twig, $property)= @_; if($property->att('name') eq 'youaretheoneiwant') { my $htmlToEncode = $property->text; my $htmlEncoded encode_entities( $htmlToEncode , '&\'"[]\200-\ +377' ); #print "\n\n\n\n\n" . $htmlEncoded ."\n\n\n\n\n"; $property->set_text( $htmlEncoded ); #print "\n\n\n\n\n" . $property->text ."\n\n\n\n\n"; $twig->flush; } }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: HTML encoding UTF-8 characters in an HTML block
by Anonymous Monk on Dec 22, 2014 at 17:15 UTC | |
|
Re: HTML encoding UTF-8 characters in an HTML block
by graff (Chancellor) on Dec 23, 2014 at 07:28 UTC |