in reply to UTF in Perl

Sometimes it throws error which i am unable to trace

And what's the text of the error message?

I agree with oshalla that Encode is the way to go. You simple decode($encoding, $string) every string that comes from the outside, and encode($encoding, $string) stuff before you write it it. (Or if your XML module is smart enough you just pass the decoded string to the module).

See also Character encodings and perl, perluniintro, perlunifaq, perlunicode, Encode.

Replies are listed 'Best First'.
Re^2: UTF in Perl
by KarthikK (Sexton) on Oct 10, 2008 at 14:17 UTC
    somehow the encode and docode dosent work!


    my code(snippet) looks like this
    use XML::SMART; use Encode; my $XML = XML::Smart->new(q`<?xml version="1.0" encoding="UTF-8" ?> <MSR-ISSUE xmlns="" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instan +ce" xsi:noNamespaceSchemaLocation="my.xsd"> </MSR-ISSUE>`, 'XML::Smart::Parser'); my $test_in_incoming_xml = "Auch wenn man es nach Jahren guter Beschäf +tigung kaum verstehen kann"; my $decoded_string = decode("utf8", $test_in_incoming_xml); $XML->{'MSR-ISSUE'}{'SHORT-NAME'}->content(0,$decoded_string); my $xmlfile = "C:\\Temp\\TestFile.xml"; $XML->save($xmlfile, nometagen => 1, forceutf8 => 1);
    Output comes like this in my XML under SHORT-NAME node "Auch wenn man es nach Jahren guter Besch&#65533;ftigung kaum verstehen kan"
    The XML seems to be valid with utf8 encoding.
    Lets i want to read this xml again and insert this to a database field and when i use the encode, it doesnt seem to work
    #...xml handling... my $XMLRead = XML::Smart->new("C:\\Temp\\TestFile.xml", 'XML::Smart::P +arser'); my $sendername = $XMLRead->{'MSR-ISSUE'}{'SHORT-NAME'}; my $encoded_string = encode("cp1250", $sendername); print $encoded_string;
    This prints nothing. When i add a static text to the vriable $sendername the script prints like below
    #...xml initi... my $sendername = $XMLRead->{'MSR-ISSUE'}{'SHORT-NAME'}; $sendername = $sendername . " Test"; my $encoded_string = encode("cp1250", $sendername); print $encoded_string;
    The output is printed like this
    Auch wenn man es nach Jahren guter Besch?ftigung kaum verstehen kann Test
    Basically i would expect the Umlaut back after encoding. Am i doing anything wrong here?
    Kindly help me
      my code(snippet) looks like this
      use XML::SMART;

      Where did you get XML::SMART from? I can't find that on cpan, only XML::Smart.

      The XML seems to be valid with utf8 encoding.

      Funny, when I run your script it prints the ä in Latin-1, which is a bug in the module (IMHO).

      This works for me (ie produces a valid utf-8 XML file) (source file stored in UTF-8):

      use XML::Smart; use Encode; my $XML = XML::Smart->new(q`<?xml version="1.0" encoding="UTF-8" ?> <MSR-ISSUE xmlns="" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instan +ce" xsi:noNamespaceSchemaLocation="my.xsd"> </MSR-ISSUE>`, 'XML::Smart::Parser'); my $test_in_incoming_xml = "Auch wenn man es nach Jahren guter Beschäf +tigung kaum verstehen kann"; utf8::upgrade($test_in_incoming_xml); $XML->{'MSR-ISSUE'}{'SHORT-NAME'}->content(0,$test_in_incoming_xml); my $xmlfile = "foo.xml"; $XML->save($xmlfile, nometagen => 1, forceutf8 => 1);

      I'm pretty sure that the utf8::upgrade line is fundamentally wrong, and compensates for a XML::Smart bug.

        Thanks Moritz!
        XML::Smart and XML::SMART both works! no idea how! this is the same module you had referred.
        I am completely lost here :-(
        Basically this is what i get:

        1. I get a XML which is un utf8 format. IT will have all sorts of special characters but utf8 encoded.
        2. I will have to get these values convert them back to windows-1256 or 1252-MS Windows Latin 1 so that the users see the text properly
        3. I have to export the same back as XML from database in utf8 format.
        Currently i use the UTF8Simple which is buggy :-(
        Somehow this "use encoding" too dosent seem to work!
Re^2: UTF in Perl
by KarthikK (Sexton) on Oct 10, 2008 at 12:45 UTC
    Thanks guys. Problem is that I am unable to find the error message :-( I will try the Encode module