Hexadecimal Entity Problem in XML::Twig

Samy_rio has asked for the wisdom of the Perl Monks concerning the following question:

Greeting monks, I have one problem while using XML::Twig and my tried code as:

use strict;
use XML::Twig;
my $file = $ARGV[0];

my $first = 'paragraph';
my $replace = 'p';

my $twig = XML::Twig->new(twig_handlers => 
{"$first" => sub {$_->set_gi("$replace")}},
);
$twig->parsefile($file);
$twig->print;
[download]

It works fine, But it convert all hexadecimal entities with some symbols. I need same text as input.

Input:

<paragraph>Soci&#x00E9;t&#x00E9; Nationale des Chemins de Fer Fran&#x0
+0E7;ais, the Spain (Telef&#x00F3;nica)</paragraph>

Current Output:

<p>SociÃ©tÃ© Nationale des Chemins de Fer FranÃ§ais, the Spain (TelefÃ
+³nica)</p>

Expected Output:

<p>Soci&#x00E9;t&#x00E9; Nationale des Chemins de Fer Fran&#x00E7;ais,
+ the Spain (Telef&#x00F3;nica)</p>
[download]

How to avoid this?

Regards,
Velusamy R.

Comment on Hexadecimal Entity Problem in XML::Twig Select or Download Code

Replies are listed 'Best First'.
Re: Hexadecimal Entity Problem in XML::Twig by Skeeve (Parson) on Oct 27, 2005 at 14:21 UTC
perldoc XML::Twig says: `keep_encoding This is a (slightly?) evil option: if the XML document is not UTF-8 encoded and you want to keep it that way, then setting keep_encoding will use the"Expat" original_string method for character, thus keeping the original encoding, as well as the original entities in the strings` `s$$([},&%#}/&/]+}%&{});#$&&s&&$^X.($'^"%]=\&(\|?{%` `+`.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e	[reply] [d/l] [select]
Re: Hexadecimal Entity Problem in XML::Twig by mirod (Canon) on Oct 27, 2005 at 14:54 UTC
If you can live with `é` instead of `é` you can use the `output_filter => "safe_hex"` option when you create the twig. Otherwise you can create your own filter based on the code for the `safe_hex` option (basically a call to `encode( ascii => $str, $FB_XMLCREF)` `keep_encoding` should only be used if you really want to modify as little as possible the original document.	[reply]
Re: Hexadecimal Entity Problem in XML::Twig by graff (Chancellor) on Oct 28, 2005 at 01:05 UTC
So, in the "current output" as posted, your data is being converted to utf-8 encoding. If the "keep_encoding" trick mentioned above has anything wrong with it, you could just post-process your data to convert "wide" utf8 characters back to the original hex-numeric code point notation: `$output =~ s/(\P{IsASCII})/sprintf("&#%4x;",$1)/g;` [download] That replaces every non-ASCII character with its hex-unicode entity notation. See "perldoc perlre" about the "\p" and "\P" constructs and character classes.	[reply] [d/l]