Re: Re: Encoding of DBI PostgreSQL output

This is getting closer!

Indeed, I'm getting question marks now, so it seems that the characters are indeed somewhat out of range.

But it also makes it harder to understand... Because the characters should be the standard Norwegian characters, all of which are in Latin1... If there are some characters that aren't I could tolerate a ? now and then, if they weren't Norwegian...

What could they be, then....?

Messages come in via e-mail, then encoded as iso8859-1, quoted-printable.

In my initial perl-script, they are decoded using MIME::QuotedPrint. The strings are then inserted into the DB by DBI.

If I use the psql client, the letters come out right, but if I print them they come out as e.g. Ă|.

The two characters Ă| imply UTF8, doesn't it?

I tried to encode with UTF16, but it resulted in errors like:

 
UTF-16BE:Partial character at /usr/lib/perl5/5.8.0/i386-linux-thread-m
+ulti/Encode.pm line 156.
[download]

But I guess that's sign it is not UTF16... This happened for LE too.

So, I guess what this means, is that it is UTF8, but for some reason, the normal Norwegian characters is now outside the range of Latin1. I've seen ř QP-encoded as =F8, and that corresponds to its hex in Latin1. But, apparently something happens in the database at some point.

I couldn't find a hex tool here now, but I'll look for it.

Thanks a lot for the help, and more suggestions are always very welcome!

Comment on Re: Re: Encoding of DBI PostgreSQL output Select or Download Code

Replies are listed 'Best First'.
Re: Re: Re: Encoding of DBI PostgreSQL output by graff (Chancellor) on May 21, 2003 at 16:15 UTC
The two characters `Ă\|` imply UTF8, doesn't it? Well, that's not clear... It's closer to utf8 than it is to anything else I'm aware of, but the second character you have posted there is a plain-ascii "vertical bar", \x7c, which in combination with the initial A-tilde (\xC3) constitutes an invalid, unusable byte sequence for utf8. That sort of problem would certainly explain the presence of a "?" when you try to convert this to latin1. I couldn't find a hex tool here now, but I'll look for it. Sounds like you really need one. All unix/linux systems have "od" (and GNU and others have MSwindows-ported versions); naturally, Perl can be used to provide this facility as well: `@bytes = unpack "C*", $_; # break utf8 string into bytes for ($i=0; $i<@bytes; $i+=8) { $j = ($i+7 < $#bytes) ? $i+7 : $#bytes; print join(" ", map {sprintf "%.2x", $bytes[$_]} $i .. $j), $/; }` [download] (That's a real kluge, but good enough to start with.) If, as seems possible, your DB entries contain corrupted utf8 character data, you'll need to diagnose the problems, patch them, and update the tables as needed -- you should be able to reconstruct the intended characters to replace mangled ones, based on context. Good luck with that.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re: Re: Re: Encoding of DBI PostgreSQL output
by graff (Chancellor) on May 21, 2003 at 16:15 UTC

The two characters Ă| imply UTF8, doesn't it?

Well, that's not clear... It's closer to utf8 than it is to anything else I'm aware of, but the second character you have posted there is a plain-ascii "vertical bar", \x7c, which in combination with the initial A-tilde (\xC3) constitutes an invalid, unusable byte sequence for utf8. That sort of problem would certainly explain the presence of a "?" when you try to convert this to latin1.

I couldn't find a hex tool here now, but I'll look for it.

Sounds like you really need one. All unix/linux systems have "od" (and GNU and others have MSwindows-ported versions); naturally, Perl can be used to provide this facility as well:

@bytes = unpack "C*", $_; # break utf8 string into bytes
for ($i=0; $i<@bytes; $i+=8) {
   $j = ($i+7 < $#bytes) ? $i+7 : $#bytes;
   print join(" ", map {sprintf "%.2x", $bytes[$_]} $i .. $j), $/;
}
[download]

If, as seems possible, your DB entries contain corrupted utf8 character data, you'll need to diagnose the problems, patch them, and update the tables as needed -- you should be able to reconstruct the intended characters to replace mangled ones, based on context. Good luck with that.

[reply]
[d/l]
[select]