Using a variable with UTF8 content coming from XPATH findvalue

inguanzo has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Using a variable with UTF8 content coming from XPATH findvalue by graff (Chancellor) on Sep 27, 2007 at 21:24 UTC
Would it be possible for you to do the following: Extract a minimal amount of content from your input data, preserving the initial and final XML tags, one or two additional tags, and some Russian text. Write a very short but complete and runnable perl script, using whatever XML modules you normally would use, in such a way that it reads that small data sample and tries to print it out as some other form of XML stream, but fails. Post that code and data, along with something to show what the output should look like, so we can see more clearly what is going wrong (and we can run your sample script ourselves in case the problem is not immediately obvious). Apart from that, I don't know where to start in terms of suggesting what you should try in order to fix your problem, based on the information you have given so far. ~~(BTW, why are there all those "</div>" tags in your code snippet? I assume that they are not really part of your script.)~~ Thanks for fixing your code snippet.	[reply]
Re^2: Using a variable with UTF8 content coming from XPATH findvalue by inguanzo (Acolyte) on Sep 27, 2007 at 22:12 UTC
Hi, Thanks a lof for the fast reply, here is an example, Is not the real script but is reflecting the problem of extarcting content from XML and trying to past it into other XML file: #!/opt/perl/5.8/bin/perl -I /fw/subsystems/loc/tools/loca/mod/ # /fw/subsystems/loc/tools/loca/OLDDB_to_LOCA.pl use lib "/fw/subsystems/loc/tools/loca/mod/"; use strict; use Encode; use XML::XPath; use XML::XPath::XMLParser; my %SubstituteHash; my $SourceHash; my $import_xp = XML::XPath->new(filename => "russian_test.xml"); open (XML_DIFF_FH,"> test_o.xml") or die "$!"; binmode XML_DIFF_FH, ":utf8"; print XML_DIFF_FH "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Transl +ation>\n"; my $import_xp_nodeset = $import_xp->find("/Translation/String"); foreach my $import_xp_node ( $import_xp_nodeset->get_nodelist ) { my $CurrentStringId = $import_xp_node->findvalue("\@name"); $SourceHash->{$CurrentStringId} = $import_xp_node->findvalue("val +ue[\@language!=\"English\"]/text"); $SourceHash->{$CurrentStringId} = pack ("U", unpack("C", $Source +Hash->{$CurrentStringId})); print STDERR Encode::is_utf8($SourceHash->{$CurrentStringId})?"1": +"0"; $import_xp->setNodeText("/Translation/String[\@name=\"$CurrentStri +ngId\"]/value[\@language=\"English\"]/text", $SourceHash->{$CurrentStringId} ); print XML_DIFF_FH XML::XPath::XMLParser::as_string($import_xp_node +) . "\n"; print XML_DIFF_FH "<test>" . $SourceHash->{$CurrentStringId} . "</ +test>\n"; } print XML_DIFF_FH "</Translation>"; close XML_DIFF_FH; [download] About the target file for russian, pasting the file corrupts the encoding, I'll try to upload the file. Pasting, just to let you know the schema used: `<?xml version="1.0" encoding="UTF-8" ?> <Translation> <String name="cprtConsoleYES" translate="yes" ID="137856" context="YE +S button prompt for PML messages."> <sizing type="LynxOS" height="0" width="0" font="CP" fontSize="10.5" + bold="0" /> <value language="English"> <text>test</text> </value> <value language="Russian"> <text>Äà</text> </value> </String> </Translation>` [download] Well, I wasn't able to know How to upload a file, I'll upload the file under teh following link: Source file: http://www.losinguanzo.com/utf8/russian_test.xml The file generated with this script: http://www.losinguanzo.com/utf8/test_o.xml The pasted script: http://www.losinguanzo.com/utf8/test.pl Thanks a lot for your help. Inguanzo	[reply] [d/l] [select]
Re^3: Using a variable with UTF8 content coming from XPATH findvalue by graff (Chancellor) on Sep 28, 2007 at 01:43 UTC
I edited your sample data file to put in the real Cyrillic characters that you cited, edited the script to get the shebang line right for my machine, and ran it. I saw the problem that you were describing. The problem was with the "unpack": I changed the "C" to "U", and the data came out fine. Also, if I just comment out that whole "pack(... unpack(...))" line, that also works (at least on my box: macosx with perl 5.8.6). I know, that seems odd, esp. since the "is_utf8" check reports 0 ("not flagged as utf8") when the pack/unpack line is commented out, and yet the output is definitely valid utf8 Cyrillic. (update: I should also confirm that it reports 1 when using `pack('U',unpack('U',...));`) (Major mystery of the day: Encode::is_utf8 reports false on a string that comes back from XML::Path, and yet printing it to STDOUT, without doing `binmode STDOUT,":utf8"` causes a "Wide character in print" warning. Do the binmode setting on STDOUT and the warning goes away. This implies that perl somehow "knows" that it really is a utf8 string, and Encode::is_utf8 seems to be lying or mistaken. So you are a victim of misinformation from a function that, I should point out, is described under the heading "Messing with Perl's Internals" in the docs for Encode. Ugh.) BTW, the better way to open a file for utf8 output is like this: `open( OUT, ">:utf8", $filename ) or die "$filename: $!\n";` [download]	[reply] [d/l] [select]
Re^4: Using a variable with UTF8 content coming from XPATH findvalue by inguanzo (Acolyte) on Sep 28, 2007 at 06:13 UTC
Re^5: Using a variable with UTF8 content coming from XPATH findvalue by graff (Chancellor) on Sep 28, 2007 at 07:02 UTC
Some notes below your chosen depth have not been shown here
Re^3: Using a variable with UTF8 content coming from XPATH findvalue by graff (Chancellor) on Sep 27, 2007 at 22:33 UTC
I'll look at your code in more detail soon (though I worry about the fact that it seems to use some sort of local library that I might not be able to get from CPAN...) In the meantime (just for grins, as we say ;), you might try running your sample XML file through this tool that I posted a while back: tlu -- TransLiterate Unicode If the file really does contain any utf8 Russian character(s), a command line like this will tell you the exact unicode code point(s) and character name(s): `tlu -o uf test.xml \| grep CYRILLIC` [download] If there are Russian characters in your file, but they aren't really utf8-encoded (trust me, it happens!), then "tlu" will either report errors or else spit out "FFFD ... REPLACEMENT CHARACTER", and that is most likely the source of all your trouble -- you would need to convert the data from ... (whatever encoding it really is) into true and valid utf8.	[reply] [d/l]
Re^4: Using a variable with UTF8 content coming from XPATH findvalue by inguanzo (Acolyte) on Sep 27, 2007 at 22:40 UTC