in reply to Re: UTF in Perl
in thread UTF in Perl

somehow the encode and docode dosent work!


my code(snippet) looks like this
use XML::SMART; use Encode; my $XML = XML::Smart->new(q`<?xml version="1.0" encoding="UTF-8" ?> <MSR-ISSUE xmlns="" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instan +ce" xsi:noNamespaceSchemaLocation="my.xsd"> </MSR-ISSUE>`, 'XML::Smart::Parser'); my $test_in_incoming_xml = "Auch wenn man es nach Jahren guter Beschäf +tigung kaum verstehen kann"; my $decoded_string = decode("utf8", $test_in_incoming_xml); $XML->{'MSR-ISSUE'}{'SHORT-NAME'}->content(0,$decoded_string); my $xmlfile = "C:\\Temp\\TestFile.xml"; $XML->save($xmlfile, nometagen => 1, forceutf8 => 1);
Output comes like this in my XML under SHORT-NAME node "Auch wenn man es nach Jahren guter Besch&#65533;ftigung kaum verstehen kan"
The XML seems to be valid with utf8 encoding.
Lets i want to read this xml again and insert this to a database field and when i use the encode, it doesnt seem to work
#...xml handling... my $XMLRead = XML::Smart->new("C:\\Temp\\TestFile.xml", 'XML::Smart::P +arser'); my $sendername = $XMLRead->{'MSR-ISSUE'}{'SHORT-NAME'}; my $encoded_string = encode("cp1250", $sendername); print $encoded_string;
This prints nothing. When i add a static text to the vriable $sendername the script prints like below
#...xml initi... my $sendername = $XMLRead->{'MSR-ISSUE'}{'SHORT-NAME'}; $sendername = $sendername . " Test"; my $encoded_string = encode("cp1250", $sendername); print $encoded_string;
The output is printed like this
Auch wenn man es nach Jahren guter Besch?ftigung kaum verstehen kann Test
Basically i would expect the Umlaut back after encoding. Am i doing anything wrong here?
Kindly help me

Replies are listed 'Best First'.
Re^3: UTF in Perl
by moritz (Cardinal) on Oct 10, 2008 at 14:41 UTC
    my code(snippet) looks like this
    use XML::SMART;

    Where did you get XML::SMART from? I can't find that on cpan, only XML::Smart.

    The XML seems to be valid with utf8 encoding.

    Funny, when I run your script it prints the ä in Latin-1, which is a bug in the module (IMHO).

    This works for me (ie produces a valid utf-8 XML file) (source file stored in UTF-8):

    use XML::Smart; use Encode; my $XML = XML::Smart->new(q`<?xml version="1.0" encoding="UTF-8" ?> <MSR-ISSUE xmlns="" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instan +ce" xsi:noNamespaceSchemaLocation="my.xsd"> </MSR-ISSUE>`, 'XML::Smart::Parser'); my $test_in_incoming_xml = "Auch wenn man es nach Jahren guter Beschäf +tigung kaum verstehen kann"; utf8::upgrade($test_in_incoming_xml); $XML->{'MSR-ISSUE'}{'SHORT-NAME'}->content(0,$test_in_incoming_xml); my $xmlfile = "foo.xml"; $XML->save($xmlfile, nometagen => 1, forceutf8 => 1);

    I'm pretty sure that the utf8::upgrade line is fundamentally wrong, and compensates for a XML::Smart bug.

      Thanks Moritz!
      XML::Smart and XML::SMART both works! no idea how! this is the same module you had referred.
      I am completely lost here :-(
      Basically this is what i get:

      1. I get a XML which is un utf8 format. IT will have all sorts of special characters but utf8 encoded.
      2. I will have to get these values convert them back to windows-1256 or 1252-MS Windows Latin 1 so that the users see the text properly
      3. I have to export the same back as XML from database in utf8 format.
      Currently i use the UTF8Simple which is buggy :-(
      Somehow this "use encoding" too dosent seem to work!
        XML::Smart and XML::SMART both works!

        ... but only as long as you are on case insensitive file systems. As soon as that changes -> BOOM. So please use the correct spelling.

        I am completely lost here :-(

        It's not an easy topic, mostly because many modules are buggy. But I can't do more than provide you a working example.

        I'll try to give you some general advice though, most of which is already in the article I linked to above.

        • Forget about UTF8Simple. Now.
        • Use a not-so-buggy XML module. XML::LibXML and XML::Twig both have been recommended here multiple times, and I've used both (on very small projects) with success.
        • Your non-buggy XML module will decode all strings on reading, and encode them on writing. So as long as you only deal with decoded text strings, you're done. So make sure that everything that comes from the outside into your program is also decoded. Maybe IO layers (also described in this article might help you with this.
        • Use encode (or IO layers) to present your data to the user
        • Use Devel::Peek to debug your code.

        From the example text I conclude that you speak German. If that's the case I can recommend the #perlde channel on irc.perl.org, that's easier if you have more questions and don't exactly know how to ask them.