Steve_BZ has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I seem to have a strange problem with my Portuguese characters on a UK-build Kubuntu 9.10 machine with a Firebird database. This is how it goes:

my $a = "Identificação do paciente";print $a,"\n";
(and if you don't have the relevent font on your machine that first word ends '...c-cedilla, a-tilda, o' and displays nicely everywhere on this machine). The string is 25 characters long. However, if I pick the word up from a database I get problems:
my $a = __t("Patient Id");print $a,"\n";
, then I get Identificação do paciente the first word of which ends with 5 characters instead of three. Each accented charater is replaced by 2 characters '... A-tilda, paragraph-character, A-tilda, pound-sign, o'. The string is 27 characters long and when I make it a hash key the hash doen't work. The values come out as '...Next Page...' (which I think comes from my Komodo IDE trying to deal with whatever is in the key) whatever their real contents.

__t is a translate function which looks up the code on a Firebird database. The database is designed in UTF8 everywhere. When I look at it through Flamerobin or any other database utility it looks fine. I have used use utf8; at the beginning of every module and I'm using DBI with an Interbase driver.

my $loc_dsn = <<DSN; dbi:InterBase:dbname=/home/DB/TEST.FDB; ib_dialect=3; DSN $gl_dbh=DBI->connect($loc_dsn,"******","**********", { PrintError => 1, RaiseError => 1 } ) or die "Can't connect to database" . DBI->errstr;

If you have any ideas or suggestions for debugging, then I'd be very grateful.

Regards

Steve.

Replies are listed 'Best First'.
Re: Do I have a unicode problem, or is this something else?
by graff (Chancellor) on Jun 10, 2010 at 02:21 UTC
    What ikegami is saying is that the data you get from your database is indeed utf8 data, but perl is unaware that this is the case, and its default behavior is to treat it as bytes (which end up, in your situation, as single-byte Latin-1 characters). So...
    binmode STDOUT, ":utf8"; my $u = __t("Patient Id"); utf8::decode( $u ); print $u,"\n";
    Please check the utf8 manual for more details. In some situations, it might be preferable to use the Encode module:
    use Encode; binmode STDOUT, ":utf8"; my $u = decode( "utf8", __t("Patient Id")); print $u;
    (updated to use "$u" instead of "$a" -- lexical instances of perl globals can lead to confusion and anxiety...)

      Hi Graff,

      Thanks for this. So what I understand is the use utf8 that I have in my modules will just simplify any

      binmode STDOUT, ":utf8"; my $a = __t("Patient Id"); utf8::decode( $a ); print $a,"\n";
      to
      binmode STDOUT, ":utf8"; my $a = __t("Patient Id"); decode( $a ); print $a,"\n";

      Presumably I can also insert this code into __t() and not worry about putting it elsewhere.

      Thanks for this: very helpful.

      Have a good day.

      regards

      Steve

        So what I understand is the use utf8 that I have in my modules will just simplify any ... to ...

        If you think this is an enhancement -- and you have no other reason for use utf8 in your code -- I would consider it a false "advantage", especially if you need (now or in the future) to add use Encode to your script, since you will then have a clash in how the decode() function is defined.

        Did you notice this (rather prominent) passage in the perldoc "utf8" man page?

        Do not use this pragma for anything else than telling Perl that your script is written in UTF-8.

        (Italics added, bold in original.)

        utf8 doesn't export any functions, at least not by default. Your second snippet doesn't run.

        $ perl -e'use utf8; decode($_)' Undefined subroutine &main::decode called at -e line 1.
Re: Do I have a unicode problem, or is this something else?
by ikegami (Patriarch) on Jun 10, 2010 at 01:26 UTC

    The database is designed in UTF8 everywhere.

    And that appears to be what you got. The DBDs I've seen return text still encoded unless you tell them otherwise (sqlite_unicode=>1, for examples).

    I don't see anything relevant in DBD::InterBase, so it looks like it'll be up to you to decode what you get back from the DB.

    I have used use utf8; at the beginning of every module

    That simply indicates the source is encoded using UTF-8. That's not relevant here.

      Hi ikegami,

      Thanks for that, I looked at the Programming for DBI manual, and there isn't even a section for unicode in the whole manual, let alone for Intebase! So that alarmed me a bit.

      Have a good day.

      Regards

      Steve

Re: Do I have a unicode problem, or is this something else?
by Steve_BZ (Chaplain) on Jun 10, 2010 at 17:43 UTC

    Hi All,

    So that went very smoothly. Thanks to ikegami and Graff for pointing me in the right direction. However it also revealed that I have a similar problem with my file IO.

    I have used the Wx::RichTextCtrl SaveFile() command which saves in XML. Characters with accents are saved in what looks like an octal format (eg Title or Título is T& # 2 3 7 ;tulo without the spaces). I tried use open ':encoding(utf8)'; which I thought would solve all my problems - it didn't. But I guess maybe & # 2 3 7 is not utf8. It doesn't look the same. Does anyone know what it is and how I should deal with it.

    Regards

    Steve.

      Unicode character 237 (decimal, not octal) = U+00ED = LATIN SMALL LETTER I WITH ACUTE = what you want = no problem.

        Hi ikegami,

        Thanks for that. So I understand that this is a decimal code, although I'm not sure what U+00ED means.

        a) Is there a function like the decode function which will parse a variable and replace these strings with the correct unicode characters?

        b) What is this style of encoding called so I can do a google on it.

        Regards

        Steve