comment on

Dear brethren,

Recently when I tried to parse an XML file, I was puzzled by the behavior of string concatenation: I had two strings "50" and "fünfzig" (containing a german umlaut), but when I concatenated them, it became "50fÃ¼nfzig" - the umlaut was garbled.

I eventually figured out that it had to do with the encoding. The document was in Latin1, and since I had use XML::Twig with the KeepEncoding flag, the text contained in the Elements was still Latin1 after parsing. But the contents of the attributes was UTF8. The problem was obviously that I had tried to concat the contents of an attribute with text. The code below shows what happened:

use strict;

use XML::Twig;
use Text::Iconv;

my $utf2latin = Text::Iconv->new ('UTF-8', "ISO-8859-1");
my $latin2utf = Text::Iconv->new ("ISO-8859-1", 'UTF-8');

sub Node {
    my ($twig, $node) = @_;
    my $txt = $node->text ();
    my $id = $node->att ('id');
    
    #print two strings individually
    print "$id\n";
    print "$txt\n";

    #print list - OK
    print "1) ", $id, " ", $txt, "\n";

    #print with string concat - garbled
    print "2) $id $txt\n";

    #convert $id to Latin1 - now the string concat works
    my $latId = $utf2latin->convert ($id);
    print "3) $latId $txt\n";

    #convert $txt to UTF-8 - still doesn't work
    my $utfTxt = $latin2utf->convert ($txt);
    print "4) $id $utfTxt\n";

    #first string concat, then convert to Latin1 - OK
    print $utf2latin->convert ("5) $id $txt\n");

    #the concatenated string does not match the Latin1 part
    my $res = "$id $txt" !~ /$txt/;
    print "\"$id $txt\" !~ /$txt/ => $res\n";
}

package main;
my $twig = XML::Twig->new (KeepEncoding => 1, TwigHandlers => {Node =>
+ \&Node});
$twig->parse (\*DATA);
__DATA__
<?xml version="1.0" encoding="ISO-8859-1"?>
<Document>
    <Node id="50">fünfzig</Node>
</Document>
[download]

This prints:

50
fünfzig
1) 50 fünfzig
2) 50 fÃ¼nfzig
3) 50 fünfzig
4) 50 fÃÂ¼nfzig
5) 50 fünfzig
"50 fÃ¼nfzig" !~ /fÃ¼nfzig/ => 1
[download]

Now I have two questions:

1) Why can I convert the contents of the Attribute to Latin1 (example 3), but not the text to UTF8 (ex. 4)?

2) I there any way to find the encoding of a perl string? I would have saved a lot of guessing if I had been able to see that the encoding of the two strings was different.

Thanks,

pike

In reply to Problems with string concatenation and encodings by pike

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.