Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things

comment on

( [id://3333] : superdoc . print w/replies, xml ) Need Help??

This is partly a cool use of Perl and partly a question.

It shows how to convert data from UTF-8 to latin 1 (and would be very easy to adapt to other encodings), which is really important when using XML::Parser (and in fact nearly all Perl XML modules) as it returns UTF-8 no matter what the encoding of the initial file is.

It gives you the choice of 3 methods:

  • a regexp lifted from XML::TiePYX) which obviously works only for conversion to latin1,
  • using the Unicode::Strings (and Unicode::Map8) modules (lifted somewhere here or on the perl-xml mailing list, I can't remember),
  • using the Text::Iconv module (which needs the iconv library to be available on your machine) which I actually managed to figure out how to use myself, straight from the docs ;--)

Now here is my problem: using Perl 5.6.1 the regexp solution works fine for XML::Parser 2.27 but not for version 2.30 (the tag and attribute names are not converted). I have had various problems with converting encoding recently, be it with XML::TiePYX or XML::Parser, and as I am including such filters in XML::Twig I am wondering if anybody has any idea, and if you could test this script with various combinations of OS, but most important of Perl versions and XML::Parser versions, kust to have an idea of the magnitude of the problem.

Oh, and if anybody has any idea of how to solve this problem that would be very cool of course! Plus I'll take any advice on how to improve this code.

The way I create the filter function with Unicode::Strings and Text::Iconv is a little convoluted, but I needed to do it this way in XML::Twig so I thought I'd leave it as-is just to show how you can pass an extra function reference to XML::Parser::Expat. It would be very easy to simplify and just call a regular subroutine instead.

#!/bin/perl -w # converts XML data from UTF-8 back into latin1 # -r uses a regexp # -u uses Unicode::Strings # -i uses Text::Iconv (and the iconv library) # Note: -r does not work properly with XML::Parser 2.30 use strict; use XML::Parser; print "perl $] - XML::Parser $XML::Parser::VERSION\n"; my $filter; if( $ARGV[0] eq '-r') { $filter = \&latin1; } elsif( $ARGV[0] eq '-u') { $filter= unicode_convert( 'latin1'); } elsif( $ARGV[0] eq '-i') { $filter= iconv_convert( 'latin1'); } else { die "usage: $0 [-r|-u|-i]"; } # I like to escape as little characters as possible # but you might need to escape ' too (with &apos;) my %ent=( '"' => '&quot;', '<' => '&lt;', '&' => '&amp;'); my $p = new XML::Parser( Handlers => { Start => \&start, End => \&end, Default => \&default, }, filter => $filter, ); $p->parse( \*DATA); print "\n"; sub start { my( $p, $tag, %att)= @_; print '<', $p->{filter}->( $tag); while( my( $att, $val)= each %att) { print ' ', $p->{filter}->( $att), '="', $p->{filter}->( $val), + '"'; } print '>'; } sub end { my( $p, $tag)= @_; print '</', $p->{filter}->( $tag), '>'; } sub default { print $p->{filter}->( $_[0]->recognized_string()); } # shamelessly lifted from XML::TyePYX sub latin1 { my $text=shift; $text=~s{([\xc0-\xc3])(.)}{ my $hi = ord($1); my $lo = ord($2); chr((($hi & 0x03) <<6) | ($lo & 0x3F)) }ge; return $text; } sub unicode_convert { my $enc= shift; require Unicode::Map8; require Unicode::String; import Unicode::String qw(utf8); my $sub= eval q{ { my $cnv; sub { $cnv ||= new Unicode::Map8 ($enc) or die "Can't create converter"; return $cnv->to8 (utf8($_[0])->ucs2); } } }; return $sub; } sub iconv_convert { my $enc= shift; require Text::Iconv; my $sub= eval q{ { my $cnv; sub { $cnv ||= new Text::Iconv( 'utf8', $enc) or die "Can't create converter"; return $cnv->convert( $_[0]); } } }; return $sub; } __DATA__ <?xml version="1.0" encoding="ISO-8859-1"?> <docé té="val'ué">Un homme soupçonné d'être impliqué dans la mort d'un motard de la police, renversé</docé>

In reply to Converting character encodings by mirod

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.