Re: Encoding of DBI PostgreSQL output

I've been handling this sort of thing a lot using perl 5.8's Encode module; getting the unicode interpretation of data to work properly with other modules (e.g. DBI) may take some experimentation, but in a case like yours, I think the first thing I would try would go something like this:

use Encode;
use DBI;  # (or whatever you use for your PostgreSQL)

my $dbstr; # suppose this holds "unicode" data from the DB

# ... do whatever it takes to fetch a value into $dbstr;
# since DBI might not be "unicode-aware", you may need to
# coerce perl into treating the value as unicode:

my $unistr = decode( 'utf8', $dbstr );
my $latin1str = encode( 'iso-8859-1', $unistr );

print $latin1str;
[download]

Now, among the things that could go wrong are:

the database actually has utf16 (BE or LE), not utf8
there are unicode values in the database that are outside the Latin1 range
you don't have perl 5.8, and can't install it for some reason.

For the first, if you could dump the relevant "raw" database content to a file, use a hex-mode viewer on that file to see which variant of unicode you're dealing with (e.g. \x{00c0}, A-grave, would show up as one of the following byte sequences: "00 c0" (utf16BE); "c0 00" (utf16LE); "c3 80" (utf8)). With perl 5.8, just put the appropriate choice as the first arg to "decode()".

For the second point, Encode's default behavior will be to insert "?" for characters that can't be coerced into the desired character set -- watch out for question marks in your output.

For the third case, if you really are just dealing with Latin1 characters, and your DB holds utf16 data, then the easiest thing is to just remove the null bytes (s/\x0//g;), and the result will be a "pure" latin1 string. If it's utf8 and all else fails, you could just do the necessary bit-shifting to arrive at the corresponding 8859-1 characters -- e.g. this would do it:

# snippet to convert utf8 to latin1 -- NB: only works for utf8
# characters that correlate to unicode \x{0000} - \x{00ff}
# (and you really should figure out how to convert using a module)

my @bytes = unpack C*, $_; # break utf8 string into bytes
$_ = '';

while ( @bytes ) {
    my $b = shift @bytes;
    if ( $b & 0x80 ) { # start of utf8 (latin1) character
        my $c = ( $b & 3 ) << 6;  # 1st utf8 byte carries top 2 latin1
+ bits
        $_ .= chr( $c | ( shift @bytes & 0x3f ));  # 2nd byte has the 
+other 6 bits
    } else {
        $_ .= chr( $b );  # utf8 ascii is just ascii.
    }
}

# now $_ holds latin1 (single-byte, iso-8859-1) characters
[download]

(update: added a bit more commentary to the "kluged" utf8-to-latin1 conversion)

Comment on Re: Encoding of DBI PostgreSQL output Select or Download Code

Replies are listed 'Best First'.
Re: Re: Encoding of DBI PostgreSQL output by Kjetil (Sexton) on May 21, 2003 at 11:25 UTC
This is getting closer! Indeed, I'm getting question marks now, so it seems that the characters are indeed somewhat out of range. But it also makes it harder to understand... Because the characters should be the standard Norwegian characters, all of which are in Latin1... If there are some characters that aren't I could tolerate a `?` now and then, if they weren't Norwegian... What could they be, then....? Messages come in via e-mail, then encoded as iso8859-1, quoted-printable. In my initial perl-script, they are decoded using MIME::QuotedPrint. The strings are then inserted into the DB by DBI. If I use the psql client, the letters come out right, but if I print them they come out as e.g. `Ă\|`. The two characters `Ă\|` imply UTF8, doesn't it? I tried to encode with UTF16, but it resulted in errors like: `UTF-16BE:Partial character at /usr/lib/perl5/5.8.0/i386-linux-thread-m +ulti/Encode.pm line 156.` [download] But I guess that's sign it is not UTF16... This happened for LE too. So, I guess what this means, is that it is UTF8, but for some reason, the normal Norwegian characters is now outside the range of Latin1. I've seen `ř` QP-encoded as `=F8`, and that corresponds to its hex in Latin1. But, apparently something happens in the database at some point. I couldn't find a hex tool here now, but I'll look for it. Thanks a lot for the help, and more suggestions are always very welcome!	[reply] [d/l] [select]
Re: Re: Re: Encoding of DBI PostgreSQL output by graff (Chancellor) on May 21, 2003 at 16:15 UTC
The two characters `Ă\|` imply UTF8, doesn't it? Well, that's not clear... It's closer to utf8 than it is to anything else I'm aware of, but the second character you have posted there is a plain-ascii "vertical bar", \x7c, which in combination with the initial A-tilde (\xC3) constitutes an invalid, unusable byte sequence for utf8. That sort of problem would certainly explain the presence of a "?" when you try to convert this to latin1. I couldn't find a hex tool here now, but I'll look for it. Sounds like you really need one. All unix/linux systems have "od" (and GNU and others have MSwindows-ported versions); naturally, Perl can be used to provide this facility as well: `@bytes = unpack "C*", $_; # break utf8 string into bytes for ($i=0; $i<@bytes; $i+=8) { $j = ($i+7 < $#bytes) ? $i+7 : $#bytes; print join(" ", map {sprintf "%.2x", $bytes[$_]} $i .. $j), $/; }` [download] (That's a real kluge, but good enough to start with.) If, as seems possible, your DB entries contain corrupted utf8 character data, you'll need to diagnose the problems, patch them, and update the tables as needed -- you should be able to reconstruct the intended characters to replace mangled ones, based on context. Good luck with that.	[reply] [d/l] [select]