Do I have a unicode problem, or is this something else?

Steve_BZ has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I seem to have a strange problem with my Portuguese characters on a UK-build Kubuntu 9.10 machine with a Firebird database. This is how it goes:

    
my $a = "Identificação do paciente";print $a,"\n";
[download]

(and if you don't have the relevent font on your machine that first word ends '...c-cedilla, a-tilda, o' and displays nicely everywhere on this machine). The string is 25 characters long. However, if I pick the word up from a database I get problems:

    my $a = __t("Patient Id");print $a,"\n";
[download]

, then I get IdentificaÃ§Ã£o do paciente the first word of which ends with 5 characters instead of three. Each accented charater is replaced by 2 characters '... A-tilda, paragraph-character, A-tilda, pound-sign, o'. The string is 27 characters long and when I make it a hash key the hash doen't work. The values come out as '...Next Page...' (which I think comes from my Komodo IDE trying to deal with whatever is in the key) whatever their real contents.

__t is a translate function which looks up the code on a Firebird database. The database is designed in UTF8 everywhere. When I look at it through Flamerobin or any other database utility it looks fine. I have used use utf8; at the beginning of every module and I'm using DBI with an Interbase driver.

    my $loc_dsn = <<DSN;
dbi:InterBase:dbname=/home/DB/TEST.FDB;
ib_dialect=3;
DSN

        $gl_dbh=DBI->connect($loc_dsn,"******","**********", {
            PrintError => 1,                        
            RaiseError => 1                            
            }
        ) or die "Can't connect to database" . DBI->errstr;
[download]

If you have any ideas or suggestions for debugging, then I'd be very grateful.

Regards

Steve.

Comment on Do I have a unicode problem, or is this something else? Select or Download Code

Replies are listed 'Best First'.
Re: Do I have a unicode problem, or is this something else? by graff (Chancellor) on Jun 10, 2010 at 02:21 UTC
What ikegami is saying is that the data you get from your database is indeed utf8 data, but perl is unaware that this is the case, and its default behavior is to treat it as bytes (which end up, in your situation, as single-byte Latin-1 characters). So... `binmode STDOUT, ":utf8"; my $u = __t("Patient Id"); utf8::decode( $u ); print $u,"\n";` [download] Please check the utf8 manual for more details. In some situations, it might be preferable to use the Encode module: `use Encode; binmode STDOUT, ":utf8"; my $u = decode( "utf8", __t("Patient Id")); print $u;` [download] (updated to use "$u" instead of "$a" -- lexical instances of perl globals can lead to confusion and anxiety...)	[reply] [d/l] [select]
Re^2: Do I have a unicode problem, or is this something else? by Steve_BZ (Chaplain) on Jun 10, 2010 at 15:16 UTC
Hi Graff, Thanks for this. So what I understand is the `use utf8` that I have in my modules will just simplify any `binmode STDOUT, ":utf8"; my $a = __t("Patient Id"); utf8::decode( $a ); print $a,"\n";` [download] to `binmode STDOUT, ":utf8"; my $a = __t("Patient Id"); decode( $a ); print $a,"\n";` [download] Presumably I can also insert this code into __t() and not worry about putting it elsewhere. Thanks for this: very helpful. Have a good day. regards Steve	[reply] [d/l] [select]
Re^3: Do I have a unicode problem, or is this something else? by graff (Chancellor) on Jun 10, 2010 at 16:56 UTC
So what I understand is the `use utf8` that I have in my modules will just simplify any ... to ... If you think this is an enhancement -- and you have no other reason for `use utf8` in your code -- I would consider it a false "advantage", especially if you need (now or in the future) to add `use Encode` to your script, since you will then have a clash in how the `decode()` function is defined. Did you notice this (rather prominent) passage in the perldoc "utf8" man page? *Do not use this pragma for anything else than telling Perl that your script is written in UTF-8.* (Italics added, bold in original.)	[reply] [d/l] [select]
Re^4: Do I have a unicode problem, or is this something else? by Steve_BZ (Chaplain) on Jun 10, 2010 at 17:03 UTC
Re^5: Do I have a unicode problem, or is this something else? by ikegami (Patriarch) on Jun 10, 2010 at 17:35 UTC
Re^3: Do I have a unicode problem, or is this something else? by ikegami (Patriarch) on Jun 10, 2010 at 17:32 UTC
utf8 doesn't export any functions, at least not by default. Your second snippet doesn't run. `$ perl -e'use utf8; decode($_)' Undefined subroutine &main::decode called at -e line 1.` [download]	[reply] [d/l]
Re^4: Do I have a unicode problem, or is this something else? by Steve_BZ (Chaplain) on Jun 10, 2010 at 17:45 UTC
Re: Do I have a unicode problem, or is this something else? by ikegami (Patriarch) on Jun 10, 2010 at 01:26 UTC
The database is designed in UTF8 everywhere. And that appears to be what you got. The DBDs I've seen return text still encoded unless you tell them otherwise (`sqlite_unicode=>1`, for examples). I don't see anything relevant in DBD::InterBase, so it looks like it'll be up to you to decode what you get back from the DB. I have used `use utf8;` at the beginning of every module That simply indicates the source is encoded using UTF-8. That's not relevant here.	[reply] [d/l] [select]
Re^2: Do I have a unicode problem, or is this something else? by Steve_BZ (Chaplain) on Jun 10, 2010 at 15:07 UTC
Hi ikegami, Thanks for that, I looked at the Programming for DBI manual, and there isn't even a section for unicode in the whole manual, let alone for Intebase! So that alarmed me a bit. Have a good day. Regards Steve	[reply]
Re: Do I have a unicode problem, or is this something else? by Steve_BZ (Chaplain) on Jun 10, 2010 at 17:43 UTC
Hi All, So that went very smoothly. Thanks to ikegami and Graff for pointing me in the right direction. However it also revealed that I have a similar problem with my file IO. I have used the `Wx::RichTextCtrl SaveFile()` command which saves in XML. Characters with accents are saved in what looks like an octal format (eg Title or Título is T& # 2 3 7 ;tulo without the spaces). I tried `use open ':encoding(utf8)';` which I thought would solve all my problems - it didn't. But I guess maybe & # 2 3 7 is not utf8. It doesn't look the same. Does anyone know what it is and how I should deal with it. Regards Steve.	[reply] [d/l] [select]
Re^2: Do I have a unicode problem, or is this something else? by ikegami (Patriarch) on Jun 10, 2010 at 18:20 UTC
Unicode character 237 (decimal, not octal) = U+00ED = LATIN SMALL LETTER I WITH ACUTE = what you want = no problem.	[reply]
Re^3: Do I have a unicode problem, or is this something else? by Steve_BZ (Chaplain) on Jun 10, 2010 at 21:41 UTC
Hi ikegami, Thanks for that. So I understand that this is a decimal code, although I'm not sure what U+00ED means. a) Is there a function like the `decode` function which will parse a variable and replace these strings with the correct unicode characters? b) What is this style of encoding called so I can do a google on it. Regards Steve	[reply] [d/l]
Re^4: Do I have a unicode problem, or is this something else? by ikegami (Patriarch) on Jun 10, 2010 at 23:05 UTC
Re^5: Do I have a unicode problem, or is this something else? by Steve_BZ (Chaplain) on Jun 11, 2010 at 18:01 UTC
Some notes below your chosen depth have not been shown here