in reply to Re: Using a variable with UTF8 content coming from XPATH findvalue
in thread Using a variable with UTF8 content coming from XPATH findvalue

Hi,
Thanks a lof for the fast reply, here is an example, Is not the real script but is reflecting the problem of extarcting content from XML and trying to past it into other XML file:
#!/opt/perl/5.8/bin/perl -I /fw/subsystems/loc/tools/loca/mod/ # /fw/subsystems/loc/tools/loca/OLDDB_to_LOCA.pl use lib "/fw/subsystems/loc/tools/loca/mod/"; use strict; use Encode; use XML::XPath; use XML::XPath::XMLParser; my %SubstituteHash; my $SourceHash; my $import_xp = XML::XPath->new(filename => "russian_test.xml"); open (XML_DIFF_FH,"> test_o.xml") or die "$!"; binmode XML_DIFF_FH, ":utf8"; print XML_DIFF_FH "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Transl +ation>\n"; my $import_xp_nodeset = $import_xp->find("/Translation/String"); foreach my $import_xp_node ( $import_xp_nodeset->get_nodelist ) { my $CurrentStringId = $import_xp_node->findvalue("\@name"); $SourceHash->{$CurrentStringId} = $import_xp_node->findvalue("val +ue[\@language!=\"English\"]/text"); $SourceHash->{$CurrentStringId} = pack ("U*", unpack("C*", $Source +Hash->{$CurrentStringId})); print STDERR Encode::is_utf8($SourceHash->{$CurrentStringId})?"1": +"0"; $import_xp->setNodeText("/Translation/String[\@name=\"$CurrentStri +ngId\"]/value[\@language=\"English\"]/text", $SourceHash->{$CurrentStringId} ); print XML_DIFF_FH XML::XPath::XMLParser::as_string($import_xp_node +) . "\n"; print XML_DIFF_FH "<test>" . $SourceHash->{$CurrentStringId} . "</ +test>\n"; } print XML_DIFF_FH "</Translation>"; close XML_DIFF_FH;
[download]
About the target file for russian, pasting the file corrupts the encoding, I'll try to upload the file. Pasting, just to let you know the schema used:
<?xml version="1.0" encoding="UTF-8" ?> <Translation> <String name="cprtConsoleYES" translate="yes" ID="137856" context="YE +S button prompt for PML messages."> <sizing type="LynxOS" height="0" width="0" font="CP" fontSize="10.5" + bold="0" /> <value language="English"> <text>test</text> </value> <value language="Russian"> <text>Äà</text> </value> </String> </Translation>
[download]
Well, I wasn't able to know How to upload a file, I'll upload the file under teh following link: Source file: http://www.losinguanzo.com/utf8/russian_test.xml The file generated with this script: http://www.losinguanzo.com/utf8/test_o.xml The pasted script: http://www.losinguanzo.com/utf8/test.pl Thanks a lot for your help. Inguanzo

Replies are listed 'Best First'.
Re^3: Using a variable with UTF8 content coming from XPATH findvalue
by graff (Chancellor) on Sep 28, 2007 at 01:43 UTC
    I edited your sample data file to put in the real Cyrillic characters that you cited, edited the script to get the shebang line right for my machine, and ran it. I saw the problem that you were describing.

    The problem was with the "unpack": I changed the "C*" to "U*", and the data came out fine. Also, if I just comment out that whole "pack(... unpack(...))" line, that also works (at least on my box: macosx with perl 5.8.6).

    I know, that seems odd, esp. since the "is_utf8" check reports 0 ("not flagged as utf8") when the pack/unpack line is commented out, and yet the output is definitely valid utf8 Cyrillic. (update: I should also confirm that it reports 1 when using pack('U*',unpack('U*',...));)

    (Major mystery of the day: Encode::is_utf8 reports false on a string that comes back from XML::Path, and yet printing it to STDOUT, without doing binmode STDOUT,":utf8" causes a "Wide character in print" warning. Do the binmode setting on STDOUT and the warning goes away. This implies that perl somehow "knows" that it really is a utf8 string, and Encode::is_utf8 seems to be lying or mistaken. So you are a victim of misinformation from a function that, I should point out, is described under the heading "Messing with Perl's Internals" in the docs for Encode. Ugh.)

    BTW, the better way to open a file for utf8 output is like this:

    open( OUT, ">:utf8", $filename ) or die "$filename: $!\n";
      Hi,
      I forgot to test this on other OS. You are right, this script works very good without any care on UTF8. I just tried on a :

      WORKS!:::::::::::::::::::::::::::::::::
      Windows XP Perl v5.8.8

      WORKS!:::::::::::::::::::::::::::::::::
      Test performed in a Linux I have at home: RedHat Kernel Version 2.4 Perl 5.8.0

      FAIL!::::::::::::::::::::::::::::::::: TEST AT WORK: Suse Kernel Version 2.6 Perl 5.8.0

      Thanks for the help.
      Inguanzo
        Beware of 5.8.0 in general, and especially on Redhat. It's nice (lucky?) that it works in this case, but I recommend you upgrade that machine soon if it's going to play any sort of important role in your development or usage of unicode-relevant scripts.
Re^3: Using a variable with UTF8 content coming from XPATH findvalue
by graff (Chancellor) on Sep 27, 2007 at 22:33 UTC
    I'll look at your code in more detail soon (though I worry about the fact that it seems to use some sort of local library that I might not be able to get from CPAN...)

    In the meantime (just for grins, as we say ;), you might try running your sample XML file through this tool that I posted a while back: tlu -- TransLiterate Unicode

    If the file really does contain any utf8 Russian character(s), a command line like this will tell you the exact unicode code point(s) and character name(s):

    tlu -o uf test.xml | grep CYRILLIC
    If there are Russian characters in your file, but they aren't really utf8-encoded (trust me, it happens!), then "tlu" will either report errors or else spit out "FFFD ... REPLACEMENT CHARACTER", and that is most likely the source of all your trouble -- you would need to convert the data from ... (whatever encoding it really is) into true and valid utf8.
      Hi,
      I'm not using any local library, all of the XML ones are coming from CPAN. About the right encoding, It may not be a problem, since the other value that is not replaced is having the right representation. http://www.losinguanzo.com/utf8/test_o.xml I printed the value twice to be sure the problem was before the XML sustitution Running your script (Thanks ! Its really cool) :
      bash-3.2$ perl tlu.pl -o uf russian_test.xml | grep CYRILLIC 0414 Д CYRILLIC CAPITAL LETTER DE 0430 а CYRILLIC SMALL LETTER A bash-3.2$
      Thanks in advance for all your help Mauricio Inguanzo