pike has asked for the wisdom of the Perl Monks concerning the following question:

Dear brethren,

Recently when I tried to parse an XML file, I was puzzled by the behavior of string concatenation: I had two strings "50" and "fünfzig" (containing a german umlaut), but when I concatenated them, it became "50fünfzig" - the umlaut was garbled.

I eventually figured out that it had to do with the encoding. The document was in Latin1, and since I had use XML::Twig with the KeepEncoding flag, the text contained in the Elements was still Latin1 after parsing. But the contents of the attributes was UTF8. The problem was obviously that I had tried to concat the contents of an attribute with text. The code below shows what happened:

use strict; use XML::Twig; use Text::Iconv; my $utf2latin = Text::Iconv->new ('UTF-8', "ISO-8859-1"); my $latin2utf = Text::Iconv->new ("ISO-8859-1", 'UTF-8'); sub Node { my ($twig, $node) = @_; my $txt = $node->text (); my $id = $node->att ('id'); #print two strings individually print "$id\n"; print "$txt\n"; #print list - OK print "1) ", $id, " ", $txt, "\n"; #print with string concat - garbled print "2) $id $txt\n"; #convert $id to Latin1 - now the string concat works my $latId = $utf2latin->convert ($id); print "3) $latId $txt\n"; #convert $txt to UTF-8 - still doesn't work my $utfTxt = $latin2utf->convert ($txt); print "4) $id $utfTxt\n"; #first string concat, then convert to Latin1 - OK print $utf2latin->convert ("5) $id $txt\n"); #the concatenated string does not match the Latin1 part my $res = "$id $txt" !~ /$txt/; print "\"$id $txt\" !~ /$txt/ => $res\n"; } package main; my $twig = XML::Twig->new (KeepEncoding => 1, TwigHandlers => {Node => + \&Node}); $twig->parse (\*DATA); __DATA__ <?xml version="1.0" encoding="ISO-8859-1"?> <Document> <Node id="50">fünfzig</Node> </Document>

This prints:

50 fünfzig 1) 50 fünfzig 2) 50 fünfzig 3) 50 fünfzig 4) 50 fünfzig 5) 50 fünfzig "50 fünfzig" !~ /fünfzig/ => 1

Now I have two questions:

1) Why can I convert the contents of the Attribute to Latin1 (example 3), but not the text to UTF8 (ex. 4)?

2) I there any way to find the encoding of a perl string? I would have saved a lot of guessing if I had been able to see that the encoding of the two strings was different.

Thanks,

pike

Replies are listed 'Best First'.
Re: Problems with string concatenation and encodings
by bronto (Priest) on Dec 12, 2002 at 10:53 UTC

    Just 2 cents, really: as far as I know an XML parser is required to return UTF-8. You can write a document in any encoding your system supports, but when you feed it to a parser, it must return UTF-8

    AFAIK

    Ciao!
    --bronto

    # Another Perl edition of a song:
    # The End, by The Beatles
    END {
      $you->take($love) eq $you->make($love) ;
    }

Re: Problems with string concatenation and encodings
by mirod (Canon) on Dec 12, 2002 at 11:23 UTC

    What is your environment? On my machine it works just fine (see below). I use this to get the relevant information:

    alias ts='perl -v | grep "This is"; \ perl -e"use XML::Twig; \ print qq{XML::Twig: \$XML::Twig::VERSION\n}"; \ perl -e"use XML::Parser; \ print qq{XML::Parser: \$XML::Parser::VERSION\n}"; \ xmlwf -v;'

    Note that if the last line (xmlwf... returns an error you are probably using expat 1.95.2 (xmlwf did not support the -v option at the time).

    This is my test environment:</p

    This is perl, v5.6.1 built for i686-linux XML::Twig: 3.09 XML::Parser: 2.31 xmlwf using expat_1.95.5

    In any case this works fine for me:

    #!/usr/bin/perl -w use strict; use XML::Twig; sub Node { my ($twig, $node) = @_; my $txt = $node->text (); my $id = $node->att ('id'); #print two strings individually print "$id\n"; print "$txt\n"; #print list - OK print "1) list: ", $id, " ", $txt, "\n"; #print with string concat - still OK print "2) concat: $id $txt\n"; #the concatenated string DOES match the Latin1 part my $res = "$id $txt" =~ /$txt/ ? "match" : "NO match"; print "\"$id $txt\" =~ /$txt/ => $res\n"; } package main; my $twig = XML::Twig->new (KeepEncoding => 1, TwigHandlers => {Node => + \&Node}); $twig->parse (\*DATA); __DATA__ <?xml version="1.0" encoding="ISO-8859-1"?> <Document> <Node id="50">fünfzig</Node> </Document>

    A last remark: there is nothing in the XML spec that specifies what the XML processor (they don't even use the word parser) is supposed to report. It just happens that most parsers convert to UTF-8 (or probably UTF-18 in the Java world) because it makes sense. At least expat and libxml do, see The internal encoding, how and why for a good explanation. The keep_encoding option of XML::Twig is quite atypical, and it is there for the common case of all documents being in the same, known in advance (1-byte) encoding.

      The word "parser" actually appears in the first version of the recommendation, but vanishes in the last version of the document. Incidentally, you linked the first, where the word parser appears.

      Unfortunately, I wasn't able to find the source of my first post's information, but since it seems to be inaccurate, that could be more a good than a bad thing :-)

      Ciao!
      --bronto

      # Another Perl edition of a song:
      # The End, by The Beatles
      END {
        $you->take($love) eq $you->make($love) ;
      }

      Hi Mirod,

      thanks for answering my post. To start, here is what I get when I run your 'ts' command:

      This is perl, v5.6.1 built for i686-linux XML::Twig: 2.02 XML::Parser: 2.30 bash: xmlwf: command not found

      So maybe this is a problem which only occurs with older versions of Twig? Is there anything I have to change in my code if I upgrade to the newest version?

      I didn't mention Twig in the title of the post because I thought this is a problem with concatenating strings that have different encodings, not something which is specific to Twig. I thought the source of the problem is probably that the attribute values are in UTF-8, but the text was in the original encoding (Latin1), and the checks I did seemed to confirm this. I had had similar experiences with XML::LibXML, so I thought, this is probably a normal behavior.

      Still, my question remains: is there any way to find the encoding of a (perl) string? This would help me a lot to avoid similar problems in future.

      Thanks,

      pike

        Whaouh! XML::Twig 2.02 is pretty old! You should definitely upgrade, provided you dont (ab)use tricks like including mark-up in the text of elements (see the Changes file): provided you use the keep_encoding option, the new version will get the attributes in the original encoding, not in UTF-8.

        As far as guessing the encoding of a string, you can try Encode::Guess, but it might not work with 5.6.1 (it is part of 5.8.0 core) and, as stated by the author Use this module with care.