in reply to Re: Encoding of DBI PostgreSQL output
in thread Encoding of DBI PostgreSQL output

Looks like this is pretty much the solution for me too...

I'm running my files through this:

while (<>) { s/\xc3\xa5/å/g; s/\xc3\xb8/ø/g; s/\xc3\xa6/æ/g; s/\xc3\xa9/é/g; s/\xc3\x85/Å/g; s/\xc3\x86/Æ/g; s/\xc3\x98/Ø/g; s/\xc3\x96/Ö/g; print; }

I have been able to get a hexdump of the files, but I really don't understand what I read, because it looks as there are other characters inserted between those hex characters, e.g.:

6469 c320 7396 6c74

c3 is the à and 96 is indeed not in Latin1.... I think the latter probably has something to do with the question marks I got.

Why this is so, or if it is significant is beyond me, and unfortunately, the pressure on me is such I won't have time to find out (hate that). But it works reasonably well.

Replies are listed 'Best First'.
Re: Re: Re: Encoding of DBI PostgreSQL output
by graff (Chancellor) on May 21, 2003 at 16:40 UTC
    6469 c320 7396 6c74

    Whoa! Now that's a symptom of something gone awry in your interpretation of the data.

    I'll bet that the intended string, dumped out as octets rather than as 16-bit "words", would be:

    69 64 20 c3 96 73 74 6c i d <sp> <U00D6> s t l O-umlaut
    Note that your 16-bit rendering makes it look like there are at least two errors in the string; shuffling the bytes back to their true order makes both problems go away.

    When you handle utf8 data a utf8 string as binary data, ALWAYS treat it as bytes, NEVER as 16-bit words. (update: when handling a utf8 string in perl 5.6 or later as a perl-internal unicode character string, you will of course treat it as characters, and you'll stop thinking in terms of bytes.) This will save you from byte-order issues, which are obviously coming into play here. It's an intrinsic part of utf8's design.

Re: Re: Re: Encoding of DBI PostgreSQL output
by AndyH (Sexton) on May 22, 2003 at 13:38 UTC
    Found the answer (for my problem, anyway) here:
    http://www.perldoc.com/perl5.8.0/pod/perluniintro.html#Displaying-Unicode-As-Text
    and adapted it to solve my need thus:
    sub hexent { my $utf8=shift; # convert utf8 characters greater than 255 into hex entities my $mapped=join("", map { $_ > 255 ? sprintf("&#x%04X;", $_) : chr +($_) } unpack("U*", $utf8));; return $mapped; }

    Four of hardest-to-understand Perl function in one command line!

    Hope this helps you with your problem ...

    AndyH