comment on

I've been handling this sort of thing a lot using perl 5.8's Encode module; getting the unicode interpretation of data to work properly with other modules (e.g. DBI) may take some experimentation, but in a case like yours, I think the first thing I would try would go something like this:

use Encode;
use DBI;  # (or whatever you use for your PostgreSQL)

my $dbstr; # suppose this holds "unicode" data from the DB

# ... do whatever it takes to fetch a value into $dbstr;
# since DBI might not be "unicode-aware", you may need to
# coerce perl into treating the value as unicode:

my $unistr = decode( 'utf8', $dbstr );
my $latin1str = encode( 'iso-8859-1', $unistr );

print $latin1str;
[download]

Now, among the things that could go wrong are:

the database actually has utf16 (BE or LE), not utf8
there are unicode values in the database that are outside the Latin1 range
you don't have perl 5.8, and can't install it for some reason.

For the first, if you could dump the relevant "raw" database content to a file, use a hex-mode viewer on that file to see which variant of unicode you're dealing with (e.g. \x{00c0}, A-grave, would show up as one of the following byte sequences: "00 c0" (utf16BE); "c0 00" (utf16LE); "c3 80" (utf8)). With perl 5.8, just put the appropriate choice as the first arg to "decode()".

For the second point, Encode's default behavior will be to insert "?" for characters that can't be coerced into the desired character set -- watch out for question marks in your output.

For the third case, if you really are just dealing with Latin1 characters, and your DB holds utf16 data, then the easiest thing is to just remove the null bytes (s/\x0//g;), and the result will be a "pure" latin1 string. If it's utf8 and all else fails, you could just do the necessary bit-shifting to arrive at the corresponding 8859-1 characters -- e.g. this would do it:

# snippet to convert utf8 to latin1 -- NB: only works for utf8
# characters that correlate to unicode \x{0000} - \x{00ff}
# (and you really should figure out how to convert using a module)

my @bytes = unpack C*, $_; # break utf8 string into bytes
$_ = '';

while ( @bytes ) {
    my $b = shift @bytes;
    if ( $b & 0x80 ) { # start of utf8 (latin1) character
        my $c = ( $b & 3 ) << 6;  # 1st utf8 byte carries top 2 latin1
+ bits
        $_ .= chr( $c | ( shift @bytes & 0x3f ));  # 2nd byte has the 
+other 6 bits
    } else {
        $_ .= chr( $b );  # utf8 ascii is just ascii.
    }
}

# now $_ holds latin1 (single-byte, iso-8859-1) characters
[download]

(update: added a bit more commentary to the "kluged" utf8-to-latin1 conversion)

In reply to Re: Encoding of DBI PostgreSQL output by graff
in thread Encoding of DBI PostgreSQL output by Kjetil

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.