Samy_rio has asked for the wisdom of the Perl Monks concerning the following question:

Greeting monks, I have one problem while using XML::Twig and my tried code as:

use strict; use XML::Twig; my $file = $ARGV[0]; my $first = 'paragraph'; my $replace = 'p'; my $twig = XML::Twig->new(twig_handlers => {"$first" => sub {$_->set_gi("$replace")}}, ); $twig->parsefile($file); $twig->print;

It works fine, But it convert all hexadecimal entities with some symbols. I need same text as input.

Input: <paragraph>Soci&#x00E9;t&#x00E9; Nationale des Chemins de Fer Fran&#x0 +0E7;ais, the Spain (Telef&#x00F3;nica)</paragraph> Current Output: <p>Société Nationale des Chemins de Fer Français, the Spain (Telefà +³nica)</p> Expected Output: <p>Soci&#x00E9;t&#x00E9; Nationale des Chemins de Fer Fran&#x00E7;ais, + the Spain (Telef&#x00F3;nica)</p>

How to avoid this?

Regards,
Velusamy R.

Replies are listed 'Best First'.
Re: Hexadecimal Entity Problem in XML::Twig
by Skeeve (Parson) on Oct 27, 2005 at 14:21 UTC
    perldoc XML::Twig says:
    keep_encoding
    This is a (slightly?) evil option: if the XML document is not UTF-8 encoded and you want to keep it that way, then setting keep_encoding will use the"Expat" original_string method for character, thus keeping the original encoding, as well as the original entities in the strings

    s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
    +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
Re: Hexadecimal Entity Problem in XML::Twig
by mirod (Canon) on Oct 27, 2005 at 14:54 UTC

    If you can live with é instead of é you can use the output_filter => "safe_hex" option when you create the twig. Otherwise you can create your own filter based on the code for the safe_hex option (basically a call to encode( ascii => $str, $FB_XMLCREF)

    keep_encoding should only be used if you really want to modify as little as possible the original document.

Re: Hexadecimal Entity Problem in XML::Twig
by graff (Chancellor) on Oct 28, 2005 at 01:05 UTC
    So, in the "current output" as posted, your data is being converted to utf-8 encoding. If the "keep_encoding" trick mentioned above has anything wrong with it, you could just post-process your data to convert "wide" utf8 characters back to the original hex-numeric code point notation:
    $output =~ s/(\P{IsASCII})/sprintf("&#%4x;",$1)/g;
    That replaces every non-ASCII character with its hex-unicode entity notation. See "perldoc perlre" about the "\p" and "\P" constructs and character classes.