KarthikK has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,
I have a perl script which uses RATLPERL (from IBM Rational) which is on version 5.8.6

The script does the following. It takes each record using API's provided by IBM and forms a XML. I use XML Smart parser for this task.

The output enbcoding of the XML is UTF-8 (<?xml version="1.0" encoding="utf-8" ?>)

I have all sorts of special characters like ü ä ö µ € ² ³. Since the output xml has to be utf-8, I use unicode::UTF8simple from CPAN. The reason for this was the inital version of RATLPERL which was on v 5.6.1 of perl, didnt had full support for utf/unicode

So this is what i do. When i insert the data into tables, i use this

#Convert from UTF-8 from input XML use unicode::UTF8simple; my $uref = new Unicode::UTF8simple; my $str_var = $uref->fromUTF8("windows-1256",<XMLNODE>) #$str_var->Goes to database column while getting it back from database to put into XML i use #Convert to UTF-8 for outpu XML use unicode::UTF8simple; my $uref = new Unicode::UTF8simple; my $strvar = $uref->toUTF8("windows-1256",<XMLNODE>)


The above code does not work always. Sometimes it throws error which i am unable to trace.

unicode::UTF8simple was mainly created for supporing utf from perl v 5.00 to 5.6 (as per the info in CPAN)

Since we have to ues new version of RATLPERL which is on 5.8.6, I was wondering if there are any other better way of converting my database values to utf-8 and viceversa.

I am running this script under Windows 2003 & XP.

Thanks a lot for your time

KK

Replies are listed 'Best First'.
Re: UTF in Perl
by gone2015 (Deacon) on Oct 10, 2008 at 09:18 UTC

    For 5.8.6 I suggest the core Encode module instead of the old unicode::UTF8simple.

Re: UTF in Perl
by moritz (Cardinal) on Oct 10, 2008 at 12:00 UTC
    Sometimes it throws error which i am unable to trace

    And what's the text of the error message?

    I agree with oshalla that Encode is the way to go. You simple decode($encoding, $string) every string that comes from the outside, and encode($encoding, $string) stuff before you write it it. (Or if your XML module is smart enough you just pass the decoded string to the module).

    See also Character encodings and perl, perluniintro, perlunifaq, perlunicode, Encode.

      somehow the encode and docode dosent work!


      my code(snippet) looks like this
      use XML::SMART; use Encode; my $XML = XML::Smart->new(q`<?xml version="1.0" encoding="UTF-8" ?> <MSR-ISSUE xmlns="" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instan +ce" xsi:noNamespaceSchemaLocation="my.xsd"> </MSR-ISSUE>`, 'XML::Smart::Parser'); my $test_in_incoming_xml = "Auch wenn man es nach Jahren guter Beschäf +tigung kaum verstehen kann"; my $decoded_string = decode("utf8", $test_in_incoming_xml); $XML->{'MSR-ISSUE'}{'SHORT-NAME'}->content(0,$decoded_string); my $xmlfile = "C:\\Temp\\TestFile.xml"; $XML->save($xmlfile, nometagen => 1, forceutf8 => 1);
      Output comes like this in my XML under SHORT-NAME node "Auch wenn man es nach Jahren guter Besch&#65533;ftigung kaum verstehen kan"
      The XML seems to be valid with utf8 encoding.
      Lets i want to read this xml again and insert this to a database field and when i use the encode, it doesnt seem to work
      #...xml handling... my $XMLRead = XML::Smart->new("C:\\Temp\\TestFile.xml", 'XML::Smart::P +arser'); my $sendername = $XMLRead->{'MSR-ISSUE'}{'SHORT-NAME'}; my $encoded_string = encode("cp1250", $sendername); print $encoded_string;
      This prints nothing. When i add a static text to the vriable $sendername the script prints like below
      #...xml initi... my $sendername = $XMLRead->{'MSR-ISSUE'}{'SHORT-NAME'}; $sendername = $sendername . " Test"; my $encoded_string = encode("cp1250", $sendername); print $encoded_string;
      The output is printed like this
      Auch wenn man es nach Jahren guter Besch?ftigung kaum verstehen kann Test
      Basically i would expect the Umlaut back after encoding. Am i doing anything wrong here?
      Kindly help me
        my code(snippet) looks like this
        use XML::SMART;

        Where did you get XML::SMART from? I can't find that on cpan, only XML::Smart.

        The XML seems to be valid with utf8 encoding.

        Funny, when I run your script it prints the ä in Latin-1, which is a bug in the module (IMHO).

        This works for me (ie produces a valid utf-8 XML file) (source file stored in UTF-8):

        use XML::Smart; use Encode; my $XML = XML::Smart->new(q`<?xml version="1.0" encoding="UTF-8" ?> <MSR-ISSUE xmlns="" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instan +ce" xsi:noNamespaceSchemaLocation="my.xsd"> </MSR-ISSUE>`, 'XML::Smart::Parser'); my $test_in_incoming_xml = "Auch wenn man es nach Jahren guter Beschäf +tigung kaum verstehen kann"; utf8::upgrade($test_in_incoming_xml); $XML->{'MSR-ISSUE'}{'SHORT-NAME'}->content(0,$test_in_incoming_xml); my $xmlfile = "foo.xml"; $XML->save($xmlfile, nometagen => 1, forceutf8 => 1);

        I'm pretty sure that the utf8::upgrade line is fundamentally wrong, and compensates for a XML::Smart bug.

      Thanks guys. Problem is that I am unable to find the error message :-( I will try the Encode module