Kjetil has asked for the wisdom of the Perl Monks concerning the following question:

Oh, monks, I think I have tried everything, but the solution....

I'm retrieving data from a PostgreSQL database, using Perl DBI.

The data is to be printed on mail labels, using Postscript::MailLabels, which is an excellent module, but apparently, it needs Latin1-encoded data to work.

Natively, data in the DB is UNICODE encoded, I assume that means UTF8. So, the problem is to reencode the data into Latin1.

There are, apparently, many modules to do this, Unicode::Map, Unicode::Map8, Unicode::MapUTF8 and Unicode::String, but if I use these, most of them apparently just strips off the Norwegian characters (which I'm interested in), or in some cases, I have lost the whole string.

Probably, I'm just using it wrongly.... Digesting the fine manuals have been attempted, for example, I have set

Unicode::String->stringify_as("utf8");

which is supposed to tell the module that the strings it gets are UTF8. But it doesn't hear my prayers...

Finally, I've tried to let PostgreSQL do the work, by going

$rv = $dbh->do("SET CLIENT_ENCODING TO 'LATIN1';");

but, apparently, the output I get is 7 bits, as ø is mapped to x (IIRC).

Wise ones, surely you have been in the same position when you were young, how do I do this right?

Replies are listed 'Best First'.
Re: Encoding of DBI PostgreSQL output
by skillet-thief (Friar) on May 20, 2003 at 17:43 UTC

    I am not an expert, but I think I followed the same path as you. Here is the solution that works for me, which is only slightly different from the last thing you tried:

    $dbh->do(" SET_CLIENT_ENCODING = 'Latin1' ");

    I'm not sure that using '=' instead of 'TO' matters or not, but this did work for me.

    HTH

    Update: Fixed typo (forgot 'do' in code).

      Thanks, but unfortunately, it didn't work for me... :-(
Re: Encoding of DBI PostgreSQL output
by gmpassos (Priest) on May 20, 2003 at 18:16 UTC
    Note that if you have characters uper than 255, you can't represent them with ASII(extended|8-bit) (Latin1). Probably you already know that, but I don't know if the Norwegian characters are or not uper than 255.

    But you can use pack to check the index of the character:

    $utf8_char = pack("U", 300 ) ; ## Change here for 'XX' (the char). print unpack("U", $utf8_char ) ;

    Graciliano M. P.
    "The creativity is the expression of the liberty".

      I must admit I didn't really understand that, I tried to replace U with å, but it didn't turn out anything sensible.

      However, I think that the Norwegian characters are within the usual 8 bit, in ISO-8859-1, ø is E8 (in hex), and that's the last of the interesting characters.

        No! You must change the value of $utf8_char.

        I put a pack() in the place of the value, since my OS is only ASCII.

        Graciliano M. P.
        "The creativity is the expression of the liberty".

Re: Encoding of DBI PostgreSQL output
by graff (Chancellor) on May 20, 2003 at 23:14 UTC
    I've been handling this sort of thing a lot using perl 5.8's Encode module; getting the unicode interpretation of data to work properly with other modules (e.g. DBI) may take some experimentation, but in a case like yours, I think the first thing I would try would go something like this:
    use Encode; use DBI; # (or whatever you use for your PostgreSQL) my $dbstr; # suppose this holds "unicode" data from the DB # ... do whatever it takes to fetch a value into $dbstr; # since DBI might not be "unicode-aware", you may need to # coerce perl into treating the value as unicode: my $unistr = decode( 'utf8', $dbstr ); my $latin1str = encode( 'iso-8859-1', $unistr ); print $latin1str;
    Now, among the things that could go wrong are:
    • the database actually has utf16 (BE or LE), not utf8
    • there are unicode values in the database that are outside the Latin1 range
    • you don't have perl 5.8, and can't install it for some reason.

    For the first, if you could dump the relevant "raw" database content to a file, use a hex-mode viewer on that file to see which variant of unicode you're dealing with (e.g. \x{00c0}, A-grave, would show up as one of the following byte sequences: "00 c0" (utf16BE); "c0 00" (utf16LE); "c3 80" (utf8)). With perl 5.8, just put the appropriate choice as the first arg to "decode()".

    For the second point, Encode's default behavior will be to insert "?" for characters that can't be coerced into the desired character set -- watch out for question marks in your output.

    For the third case, if you really are just dealing with Latin1 characters, and your DB holds utf16 data, then the easiest thing is to just remove the null bytes (s/\x0//g;), and the result will be a "pure" latin1 string. If it's utf8 and all else fails, you could just do the necessary bit-shifting to arrive at the corresponding 8859-1 characters -- e.g. this would do it:

    # snippet to convert utf8 to latin1 -- NB: only works for utf8 # characters that correlate to unicode \x{0000} - \x{00ff} # (and you really should figure out how to convert using a module) my @bytes = unpack C*, $_; # break utf8 string into bytes $_ = ''; while ( @bytes ) { my $b = shift @bytes; if ( $b & 0x80 ) { # start of utf8 (latin1) character my $c = ( $b & 3 ) << 6; # 1st utf8 byte carries top 2 latin1 + bits $_ .= chr( $c | ( shift @bytes & 0x3f )); # 2nd byte has the +other 6 bits } else { $_ .= chr( $b ); # utf8 ascii is just ascii. } } # now $_ holds latin1 (single-byte, iso-8859-1) characters
    (update: added a bit more commentary to the "kluged" utf8-to-latin1 conversion)
      This is getting closer!

      Indeed, I'm getting question marks now, so it seems that the characters are indeed somewhat out of range.

      But it also makes it harder to understand... Because the characters should be the standard Norwegian characters, all of which are in Latin1... If there are some characters that aren't I could tolerate a ? now and then, if they weren't Norwegian...

      What could they be, then....?

      Messages come in via e-mail, then encoded as iso8859-1, quoted-printable.

      In my initial perl-script, they are decoded using MIME::QuotedPrint. The strings are then inserted into the DB by DBI.

      If I use the psql client, the letters come out right, but if I print them they come out as e.g. Ã|.

      The two characters Ã| imply UTF8, doesn't it?

      I tried to encode with UTF16, but it resulted in errors like:

      UTF-16BE:Partial character at /usr/lib/perl5/5.8.0/i386-linux-thread-m +ulti/Encode.pm line 156.

      But I guess that's sign it is not UTF16... This happened for LE too.

      So, I guess what this means, is that it is UTF8, but for some reason, the normal Norwegian characters is now outside the range of Latin1. I've seen ø QP-encoded as =F8, and that corresponds to its hex in Latin1. But, apparently something happens in the database at some point.

      I couldn't find a hex tool here now, but I'll look for it.

      Thanks a lot for the help, and more suggestions are always very welcome!

        The two characters Ã| imply UTF8, doesn't it?

        Well, that's not clear... It's closer to utf8 than it is to anything else I'm aware of, but the second character you have posted there is a plain-ascii "vertical bar", \x7c, which in combination with the initial A-tilde (\xC3) constitutes an invalid, unusable byte sequence for utf8. That sort of problem would certainly explain the presence of a "?" when you try to convert this to latin1.

        I couldn't find a hex tool here now, but I'll look for it.

        Sounds like you really need one. All unix/linux systems have "od" (and GNU and others have MSwindows-ported versions); naturally, Perl can be used to provide this facility as well:

        @bytes = unpack "C*", $_; # break utf8 string into bytes for ($i=0; $i<@bytes; $i+=8) { $j = ($i+7 < $#bytes) ? $i+7 : $#bytes; print join(" ", map {sprintf "%.2x", $bytes[$_]} $i .. $j), $/; }
        (That's a real kluge, but good enough to start with.)

        If, as seems possible, your DB entries contain corrupted utf8 character data, you'll need to diagnose the problems, patch them, and update the tables as needed -- you should be able to reconstruct the intended characters to replace mangled ones, based on context. Good luck with that.

Re: Encoding of DBI PostgreSQL output
by argggh (Monk) on May 21, 2003 at 07:44 UTC
    It's not really clear to me whether your problem is with Postgres or Perl. Supposing that the data returned from Postgres is proper UTF8-strings, you should be able to use Unicode::String thus:
    $ perl -MUnicode::String=utf8 -le 'print utf8("blÃ¥bærsyltetøy")->la +tin1()' blåbærsyltetøy $
      Hehe!

      How would I do this if I were to translate a whole file?

      I tried:

      use Unicode::String qw(utf8 utf16 latin1); Unicode::String->stringify_as("utf8"); while (<>) { print utf8($_)->latin1(); }

      But that did nothing...

      Anyway, that you have ø for ø seems to imply that I do have the correct characters, since that is what I see for ø too...

        Uhm, an RTFM response would be appropriate to that question... :-) OK, umap should do that...

        But umap has no effect on my files....

        Weird....?

Re: Encoding of DBI PostgreSQL output
by AndyH (Sexton) on May 21, 2003 at 10:34 UTC

    I have a similar problem and am driving myself crazy with it.

    I need to convert a string in Perl's internal format and which is guaranteed to contain no characters higher than \x{017F} (i.e. it's ASCII, plus extended Latin A) to "hex entities" e.g. &#x017f; to be sent to a browser (don't suggest changing browser or changing the encoding layer of the CGI talking to the browser - only the hex entity solution will work in my particular case!).

    I got so fed up playing with encode/decode, pack/unpack, et al, that I just did it as a clumsy mapping table (deadline was pressing) e.g.

    $string =~ s/\x{017f}/&#x017F;/g (for every char from \x00C0 to \x017F)

    which has the merits of (a) working and (b) being simple enough for a simple soul such as myself to understand. However, there has to be a decent (and faster, more elegant, etc) solution to this. Sounds like one for the fans of "map" ...

    AndyH

      Looks like this is pretty much the solution for me too...

      I'm running my files through this:

      while (<>) { s/\xc3\xa5/å/g; s/\xc3\xb8/ø/g; s/\xc3\xa6/æ/g; s/\xc3\xa9/é/g; s/\xc3\x85/Å/g; s/\xc3\x86/Æ/g; s/\xc3\x98/Ø/g; s/\xc3\x96/Ö/g; print; }

      I have been able to get a hexdump of the files, but I really don't understand what I read, because it looks as there are other characters inserted between those hex characters, e.g.:

      6469 c320 7396 6c74

      c3 is the à and 96 is indeed not in Latin1.... I think the latter probably has something to do with the question marks I got.

      Why this is so, or if it is significant is beyond me, and unfortunately, the pressure on me is such I won't have time to find out (hate that). But it works reasonably well.

        6469 c320 7396 6c74

        Whoa! Now that's a symptom of something gone awry in your interpretation of the data.

        I'll bet that the intended string, dumped out as octets rather than as 16-bit "words", would be:

        69 64 20 c3 96 73 74 6c i d <sp> <U00D6> s t l O-umlaut
        Note that your 16-bit rendering makes it look like there are at least two errors in the string; shuffling the bytes back to their true order makes both problems go away.

        When you handle utf8 data a utf8 string as binary data, ALWAYS treat it as bytes, NEVER as 16-bit words. (update: when handling a utf8 string in perl 5.6 or later as a perl-internal unicode character string, you will of course treat it as characters, and you'll stop thinking in terms of bytes.) This will save you from byte-order issues, which are obviously coming into play here. It's an intrinsic part of utf8's design.

        Found the answer (for my problem, anyway) here:
        http://www.perldoc.com/perl5.8.0/pod/perluniintro.html#Displaying-Unicode-As-Text
        and adapted it to solve my need thus:
        sub hexent { my $utf8=shift; # convert utf8 characters greater than 255 into hex entities my $mapped=join("", map { $_ > 255 ? sprintf("&#x%04X;", $_) : chr +($_) } unpack("U*", $utf8));; return $mapped; }

        Four of hardest-to-understand Perl function in one command line!

        Hope this helps you with your problem ...

        AndyH