Re: Re: Encoding of DBI PostgreSQL output

Looks like this is pretty much the solution for me too...

I'm running my files through this:

while (<>) {
    s/\xc3\xa5/å/g;
    s/\xc3\xb8/ø/g;
    s/\xc3\xa6/æ/g;
    s/\xc3\xa9/é/g;
    s/\xc3\x85/Å/g;
    s/\xc3\x86/Æ/g;
    s/\xc3\x98/Ø/g;
    s/\xc3\x96/Ö/g;
    print;
}
[download]

I have been able to get a hexdump of the files, but I really don't understand what I read, because it looks as there are other characters inserted between those hex characters, e.g.:

 6469 c320 7396 6c74
[download]

c3 is the Ã and 96 is indeed not in Latin1.... I think the latter probably has something to do with the question marks I got.

Why this is so, or if it is significant is beyond me, and unfortunately, the pressure on me is such I won't have time to find out (hate that). But it works reasonably well.

Comment on Re: Re: Encoding of DBI PostgreSQL output Select or Download Code

Replies are listed 'Best First'.
Re: Re: Re: Encoding of DBI PostgreSQL output by graff (Chancellor) on May 21, 2003 at 16:40 UTC
6469 c320 7396 6c74 Whoa! Now that's a symptom of something gone awry in your interpretation of the data. I'll bet that the intended string, dumped out as octets rather than as 16-bit "words", would be: `69 64 20 c3 96 73 74 6c i d <sp> <U00D6> s t l O-umlaut` [download] Note that your 16-bit rendering makes it look like there are at least two errors in the string; shuffling the bytes back to their true order makes both problems go away. When you handle ~~utf8 data~~ a utf8 string as binary data, ALWAYS treat it as bytes, NEVER as 16-bit words. (update: when handling a utf8 string in perl 5.6 or later as a perl-internal unicode character string, you will of course treat it as characters, and you'll stop thinking in terms of bytes.) This will save you from byte-order issues, which are obviously coming into play here. It's an intrinsic part of utf8's design.	[reply] [d/l]
Re: Re: Re: Encoding of DBI PostgreSQL output by AndyH (Sexton) on May 22, 2003 at 13:38 UTC
Found the answer (for my problem, anyway) here: http://www.perldoc.com/perl5.8.0/pod/perluniintro.html#Displaying-Unicode-As-Text and adapted it to solve my need thus: `sub hexent { my $utf8=shift; # convert utf8 characters greater than 255 into hex entities my $mapped=join("", map { $_ > 255 ? sprintf("&#x%04X;", $_) : chr +($_) } unpack("U*", $utf8));; return $mapped; }` [download] Four of hardest-to-understand Perl function in one command line! Hope this helps you with your problem ... AndyH	[reply] [d/l]