jodaka has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks, i've got a weird issue with russian letters in XML::XPath.
The background is:
an XML configuration file with descriptions in different languages. File encoding is UTF8.

Now I want to import some descriptions from other language files (which are all in UTF8 as well).
$xp = XML::XPath->new(filename => 'config.xml'); $xp->setNodeText(qq~/conf/section[\@name="$name"]/description[\@lang=" +$lang"]~, $val);
here $val is taken from external TXT file. Running
$xp->find(qq~/conf/section[\@name="$name"]/description[\@lang="$lang"] +~)->get_node(1)->string_value();
will produce me 'это тестовая строка' - a nice UTF8 string in Russian.
That's fine.

now, at the end I want to save my changed XML file and I'm doing something like this:
open(XML, ">config.xml"); print XML $xp->find('/conf')->get_node(1)->toString; close XML;
The problem here is in toString method. It kill my lovely UTF8 russian text... it transforms 'это тестовая строка' into žсновн‹е нас‚€ойки :(
looks like toString output is iso8859-1, not UTF8. There's no info in XML::XPath manual about charsets and I'm stuck with it

any advices? How do I tell XML::XPath to give me UTF8 results, not iso8859-1 ?

Replies are listed 'Best First'.
Re: UTF8 issue with XML::XPath
by Corion (Patriarch) on Mar 05, 2008 at 10:33 UTC

    I would try to set the output file to UTF-8 mode, at least that's how I did it with XML::LibXML:

    my $outfile = "config.xml"; open my $xml, ">:utf8", $outfile or die "Couldn't create '$outfile': $!"; # Alternatively: # binmode $xml, 'utf8'; print {$xml} $xp->find('/conf')->get_node(1)->toString; close $xml;
      If you need to specify the encoding on an XML file, then the XML parser is broken.

      The XML encoding is specified in the XML file itself, and the XML parser MUST parse the encoding and act accordingly.

      Seems that I was reading too fast ... it's not about parsing an XML file, but about writing an XML fragment. And indeed, this is the right solution, one has to specify the encoding himself in this situation.

Re: UTF8 issue with XML::XPath
by moritz (Cardinal) on Mar 05, 2008 at 10:34 UTC
    here $val is taken from external TXT file.

    Did you open that file with <:encoding(UTF-8)? If not, it's a byte string, and probably won't work well with clean modules.

    Update: and of course you need the >:encoding(UTF-8) ouput layer when opening the result file.

      no-no, you didn't understand me... the problem is not in text files.
      I can read/write files in UTF8 without any problems (since I'm on linux and UTF8 is my default locale) even without specifying bindmode. my console is in UTF8. And if I run
      print $xp->find('/conf')->get_node(1)->toString();
      it will give me data in wrong encoding. So it's not related with files I/O... it's in XML::XPath
        Even if your default locale is utf8, you need to specify input and output conversions. The locale doesn't effect I/O layer by default (it can be enabled with user open ':locale';, though).

        XML::XPath will most likely return text strings, while print can only work properly with byte strings.

        So you might even want to try binmode STDOUT, ':encoding(UTF-8)'; before printing.

Re: UTF8 issue with XML::XPath
by mirod (Canon) on Mar 05, 2008 at 16:22 UTC

    Are you sure it's the toString method and not the print? Try it under the debugger and see if x $xp->find('/conf')->get_node(1)->toString looks OK.

    For backward compatibility reasons perl automagically encodes strings in ISO-8859-1 when outputing to a filehandle with no known encoding. This happens even if your locale is utf-8. So you need to either do a binmode STDOUT, ':utf8' before the print, or set the environment variable PERL_UNICODE to S, in bash: PERL_UNICODE=S ./my_script.

Re: UTF8 issue with XML::XPath
by martinovski (Initiate) on Nov 05, 2010 at 23:37 UTC

    Hi,

    2,5 years later and I am having exactly the same issue. Have you found a solution yet?
      If your code has the same bug as the OP, it stands to reason the same solution applies. Posted 2.5 years ago too.
      Seeing how jodaka hasn't been around in about a year, I doubt he will answer :) Wander over to Seekers of Perl Wisdom and start a new thread

        Thanks for the answer, I solved it with:

        binmode(STDOUT, ":utf8");