http://qs1969.pair.com?node_id=817319

gam3 has asked for the wisdom of the Perl Monks concerning the following question:

Is there a way to keep XML::Parser from converting numeric entities into UTF8?

Or is there some other parser that will let me do this?

use strict; use XML::Parser; use vars qw($parser); sub handle_start { my $self = shift; my $x = shift; print "<" . $x . '>' ; } sub handle_end { my $self = shift; my $x = shift; print "</" . $x . '>' ; } sub handle_char { my $self = shift; my $x = shift; print $x; } $parser = XML::Parser->new( Handlers => { Start => \&handle_start, End => \&handle_end, Char => \&handle_char } ); $parser->parse(<<XML); <start>&#8211;</start> XML
I would like this program to output
<start>&#8211;</start>
not
<start>–</start>
-- gam3
A picture is worth a thousand words, but takes 200K.

Replies are listed 'Best First'.
Re: XML::Parser and numeric entities
by ikegami (Patriarch) on Jan 14, 2010 at 02:53 UTC

    It simply decodes the entities. It doesn't then encode the character using UTF-8.

    If you want all non-ASCII characters encoded, you can use:

    use HTML::Entities qw( encode_entities_numeric ); sub handle_char { my $self = shift; my $x = shift; print encode_entities_numeric($x); }

    There's also a handler you can use instead of Char that receives the entities still encoded, but then you're not guaranteed to have all non-ASCII characters encoded.

      Thank you for that information, I can use it to patch up my problem

      However what I really want is for XML::Parser to NOT decode the numeric entities at all.

      -- gam3
      A picture is worth a thousand words, but takes 200K.