Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Latin-1 characters and XML

by kilinrax (Deacon)
on Apr 30, 2003 at 12:44 UTC ( #254261=perlquestion: print w/replies, xml ) Need Help??

kilinrax has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to write a script do the following:

  • Read in an .unl file of translations from one language to another
  • Read in a consistent-format XML file, and parse it
  • Apply the translations to some attributes and text elements in the file
  • Write the file back out, with all latin-1 characters escaped in numeric form (e.g 'é' -> 'é'

Currently, I'm trying to do this with XML::Twig, but only the attributes are coming out escaped properly, not the text.

Read in translations and construct twig:
open( TRANSLATIONS, $translations_file ) or die "Could not open '$tran +slations_file': $!"; while( <TRANSLATIONS> ) { if( my ($word, $translation) = split /\|/ ) { push( @translations, { match => qr/\b$word\b/i, replacement => $translation, } ); } } close( TRANSLATIONS ); my $twig= new XML::Twig( twig_handlers => { 'FOO' => \&foo }, pretty_print => 'indented', output_filter => 'safe', ); $twig->parsefile( $xml_file ); $twig->print;
Apply translations to parts of element 'foo':
sub foo { my ($t, $elt)= @_; # apply translations to the name, and text of all descendant nodes if( my $name = $elt->att('NAME') ) { $elt->set_att( 'NAME', apply_translations($name) ); } my @descendants = $elt->descendants( 'BAR' ); foreach my $descendant (@descendants) { if( my $text = $descendant->text ) { $descendant->set_text( apply_translations( $text ) ); } } } sub apply_translations { my $text = shift; foreach (@translations) { $text =~ s/$_->{match}/$_->{replacement}/g; } return $text; }
Input
<FOO="Duration"> <BAR>Duration</BAR> </FOO> <FOO NAME="Airport"> <BAR>Airport</BAR> </FOO>
--->
Output
<FOO="Dur&#233;e"> <BAR>Durée</BAR> </FOO> <FOO NAME="A&#233;roport"> <BAR>A&#40111;port</BAR> </FOO>

I've also tried using XML::SAX to do the same thing, but it barfs on the xml version/encoding line, complaining 'Only ASCII encoding allowed without perl 5.7.2 or higher. You tried: iso-8859-1'.
Unfortunately another application running on this machine (not mine, so don't ask me why) requires perl 5.6.0 or 5.6.1 , so upgrading perl would be problematic at the very least.

I'm somewhat loathe to start hacking on a new version using a different module without being sure it's support for latin-1 is better than those I've tried so far.

Does anyone have any experience of doing anything like this, and have recommendations/advice they'd care to share?

Replies are listed 'Best First'.
Re: Latin-1 characters and XML
by dragonchild (Archbishop) on Apr 30, 2003 at 13:40 UTC
    I'm using XML::Parser with 5.005_3 and I've got Latin-1 and UTF-8 (for double-byte). (This is all in a production system.)

    ------
    We are the carpenters and bricklayers of the Information Age.

    Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

      And what do you use to output non-ascii characters as xml-escaped numeric entities? (e.g. 'ô' -> '&#244;')
      Is there an XML::Parser method to do this (if there is one, it's completely undocumented afaict), or do you use a seperate module?
        What reader method do I use to write XML-escaped entities?!? Think about that for a second. XML::Parser reads the XML. It doesn't write it.

        You want some XML writer, of which there are many. And, yes, they will work with Latin-1. Another option is to use something like Unicode::String.

        ------
        We are the carpenters and bricklayers of the Information Age.

        Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

        Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Re: Latin-1 characters and XML (downgrade)
by tye (Sage) on Apr 30, 2003 at 18:36 UTC

    Instead of upgrading Perl, downgrade your XML module(s). I fear someone made a big mistake allowing these modules to be upgraded on older versions of Perl without raising huge warning bells since such causes a major reduction in their functionality.

    Sorry, I wish I had more details. I've seen this problem lots of times but so far I've only been on the edge watching it happen, not actually involved. I hope my posting will encourage someone with more details to respond. (:

                    - tye
Re: Latin-1 characters and XML
by grantm (Parson) on Apr 30, 2003 at 19:05 UTC

    I don't think using SAX is likely to solve your problem directly, but the reason it's not working for you is that you haven't installed a SAX parser. XML::SAX installs a base class for writing your own SAX filters and a framework for selecting a parser module using a ParserFactory class. It also installs a PurePerl XML parser which in addition to being slow, also requires Perl 5.8 to handle encodings. If you install XML::SAX::Expat, it will configure itself into the SAX ParserFactory framework as the default parser and it does support iso-8859-1 encoding.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://254261]
Approved by Corion
Front-paged by Tanalis
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (7)
As of 2022-01-19 16:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    In 2022, my preferred method to securely store passwords is:












    Results (55 votes). Check out past polls.

    Notices?