in reply to Encoding of DBI PostgreSQL output
Now, among the things that could go wrong are:use Encode; use DBI; # (or whatever you use for your PostgreSQL) my $dbstr; # suppose this holds "unicode" data from the DB # ... do whatever it takes to fetch a value into $dbstr; # since DBI might not be "unicode-aware", you may need to # coerce perl into treating the value as unicode: my $unistr = decode( 'utf8', $dbstr ); my $latin1str = encode( 'iso-8859-1', $unistr ); print $latin1str;
For the first, if you could dump the relevant "raw" database content to a file, use a hex-mode viewer on that file to see which variant of unicode you're dealing with (e.g. \x{00c0}, A-grave, would show up as one of the following byte sequences: "00 c0" (utf16BE); "c0 00" (utf16LE); "c3 80" (utf8)). With perl 5.8, just put the appropriate choice as the first arg to "decode()".
For the second point, Encode's default behavior will be to insert "?" for characters that can't be coerced into the desired character set -- watch out for question marks in your output.
For the third case, if you really are just dealing with Latin1 characters, and your DB holds utf16 data, then the easiest thing is to just remove the null bytes (s/\x0//g;), and the result will be a "pure" latin1 string. If it's utf8 and all else fails, you could just do the necessary bit-shifting to arrive at the corresponding 8859-1 characters -- e.g. this would do it:
(update: added a bit more commentary to the "kluged" utf8-to-latin1 conversion)# snippet to convert utf8 to latin1 -- NB: only works for utf8 # characters that correlate to unicode \x{0000} - \x{00ff} # (and you really should figure out how to convert using a module) my @bytes = unpack C*, $_; # break utf8 string into bytes $_ = ''; while ( @bytes ) { my $b = shift @bytes; if ( $b & 0x80 ) { # start of utf8 (latin1) character my $c = ( $b & 3 ) << 6; # 1st utf8 byte carries top 2 latin1 + bits $_ .= chr( $c | ( shift @bytes & 0x3f )); # 2nd byte has the +other 6 bits } else { $_ .= chr( $b ); # utf8 ascii is just ascii. } } # now $_ holds latin1 (single-byte, iso-8859-1) characters
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Re: Encoding of DBI PostgreSQL output
by Kjetil (Sexton) on May 21, 2003 at 11:25 UTC | |
by graff (Chancellor) on May 21, 2003 at 16:15 UTC |