inguanzo has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I have been working for a while already on this problem. I read many of the already valuable available documentation from perlmonks experts, but this UTF8 support doesn't seem to be working correclty in Perl :(
I'm reading a XML utf8 russian file and at the moment I'm trying to extract the element content and pass it to another XML structure, Perl is not reading UTF8 charcaters but trying to read per byte as a charcater:
$SourceHash->{$order_value} = $import_StringTable_xp_node->findvalue( +"value[\@language!=\"English\"]/text"); print XML_DIFF_FH $SourceHash->{$order_value} . "\n"; $SourceHash->{$order_value} = pack ("U*", unpack("C*", $SourceHash->{$ +order_value})); print XML_DIFF_FH $SourceHash->{$order_value} . "\n"; print STDERR Encode::is_utf8($SourceHash->{$order_value})?"1":"0";
[download]
Even when the Filehandle is opened in utf8 mode this is not working. I can see the UTF8 flag is active with the pack conversion (this was the only mechanism that allow me to activate the flag, the other approaches doesn't seem to work i.e. decode utf8). I already tried : use bytes, use utf8 but nothing works. Any help will be highly appreciated ! Cheers PerlMonks ! Inguanzo

Replies are listed 'Best First'.
Re: Using a variable with UTF8 content coming from XPATH findvalue
by graff (Chancellor) on Sep 27, 2007 at 21:24 UTC
    Would it be possible for you to do the following:
    • Extract a minimal amount of content from your input data, preserving the initial and final XML tags, one or two additional tags, and some Russian text.

    • Write a very short but complete and runnable perl script, using whatever XML modules you normally would use, in such a way that it reads that small data sample and tries to print it out as some other form of XML stream, but fails.

    • Post that code and data, along with something to show what the output should look like, so we can see more clearly what is going wrong (and we can run your sample script ourselves in case the problem is not immediately obvious).

    Apart from that, I don't know where to start in terms of suggesting what you should try in order to fix your problem, based on the information you have given so far.

    (BTW, why are there all those "</div>" tags in your code snippet? I assume that they are not really part of your script.) Thanks for fixing your code snippet.

      Hi,
      Thanks a lof for the fast reply, here is an example, Is not the real script but is reflecting the problem of extarcting content from XML and trying to past it into other XML file:
      #!/opt/perl/5.8/bin/perl -I /fw/subsystems/loc/tools/loca/mod/ # /fw/subsystems/loc/tools/loca/OLDDB_to_LOCA.pl use lib "/fw/subsystems/loc/tools/loca/mod/"; use strict; use Encode; use XML::XPath; use XML::XPath::XMLParser; my %SubstituteHash; my $SourceHash; my $import_xp = XML::XPath->new(filename => "russian_test.xml"); open (XML_DIFF_FH,"> test_o.xml") or die "$!"; binmode XML_DIFF_FH, ":utf8"; print XML_DIFF_FH "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Transl +ation>\n"; my $import_xp_nodeset = $import_xp->find("/Translation/String"); foreach my $import_xp_node ( $import_xp_nodeset->get_nodelist ) { my $CurrentStringId = $import_xp_node->findvalue("\@name"); $SourceHash->{$CurrentStringId} = $import_xp_node->findvalue("val +ue[\@language!=\"English\"]/text"); $SourceHash->{$CurrentStringId} = pack ("U*", unpack("C*", $Source +Hash->{$CurrentStringId})); print STDERR Encode::is_utf8($SourceHash->{$CurrentStringId})?"1": +"0"; $import_xp->setNodeText("/Translation/String[\@name=\"$CurrentStri +ngId\"]/value[\@language=\"English\"]/text", $SourceHash->{$CurrentStringId} ); print XML_DIFF_FH XML::XPath::XMLParser::as_string($import_xp_node +) . "\n"; print XML_DIFF_FH "<test>" . $SourceHash->{$CurrentStringId} . "</ +test>\n"; } print XML_DIFF_FH "</Translation>"; close XML_DIFF_FH;
      About the target file for russian, pasting the file corrupts the encoding, I'll try to upload the file. Pasting, just to let you know the schema used:
      <?xml version="1.0" encoding="UTF-8" ?> <Translation> <String name="cprtConsoleYES" translate="yes" ID="137856" context="YE +S button prompt for PML messages."> <sizing type="LynxOS" height="0" width="0" font="CP" fontSize="10.5" + bold="0" /> <value language="English"> <text>test</text> </value> <value language="Russian"> <text>Да</text> </value> </String> </Translation>
      Well, I wasn't able to know How to upload a file, I'll upload the file under teh following link: Source file: http://www.losinguanzo.com/utf8/russian_test.xml The file generated with this script: http://www.losinguanzo.com/utf8/test_o.xml The pasted script: http://www.losinguanzo.com/utf8/test.pl Thanks a lot for your help. Inguanzo
        I edited your sample data file to put in the real Cyrillic characters that you cited, edited the script to get the shebang line right for my machine, and ran it. I saw the problem that you were describing.

        The problem was with the "unpack": I changed the "C*" to "U*", and the data came out fine. Also, if I just comment out that whole "pack(... unpack(...))" line, that also works (at least on my box: macosx with perl 5.8.6).

        I know, that seems odd, esp. since the "is_utf8" check reports 0 ("not flagged as utf8") when the pack/unpack line is commented out, and yet the output is definitely valid utf8 Cyrillic. (update: I should also confirm that it reports 1 when using pack('U*',unpack('U*',...));)

        (Major mystery of the day: Encode::is_utf8 reports false on a string that comes back from XML::Path, and yet printing it to STDOUT, without doing binmode STDOUT,":utf8" causes a "Wide character in print" warning. Do the binmode setting on STDOUT and the warning goes away. This implies that perl somehow "knows" that it really is a utf8 string, and Encode::is_utf8 seems to be lying or mistaken. So you are a victim of misinformation from a function that, I should point out, is described under the heading "Messing with Perl's Internals" in the docs for Encode. Ugh.)

        BTW, the better way to open a file for utf8 output is like this:

        open( OUT, ">:utf8", $filename ) or die "$filename: $!\n";
        I'll look at your code in more detail soon (though I worry about the fact that it seems to use some sort of local library that I might not be able to get from CPAN...)

        In the meantime (just for grins, as we say ;), you might try running your sample XML file through this tool that I posted a while back: tlu -- TransLiterate Unicode

        If the file really does contain any utf8 Russian character(s), a command line like this will tell you the exact unicode code point(s) and character name(s):

        tlu -o uf test.xml | grep CYRILLIC
        If there are Russian characters in your file, but they aren't really utf8-encoded (trust me, it happens!), then "tlu" will either report errors or else spit out "FFFD ... REPLACEMENT CHARACTER", and that is most likely the source of all your trouble -- you would need to convert the data from ... (whatever encoding it really is) into true and valid utf8.