Re: Encoding of DBI PostgreSQL output

I have a similar problem and am driving myself crazy with it.

I need to convert a string in Perl's internal format and which is guaranteed to contain no characters higher than \x{017F} (i.e. it's ASCII, plus extended Latin A) to "hex entities" e.g. ſ to be sent to a browser (don't suggest changing browser or changing the encoding layer of the CGI talking to the browser - only the hex entity solution will work in my particular case!).

I got so fed up playing with encode/decode, pack/unpack, et al, that I just did it as a clumsy mapping table (deadline was pressing) e.g.

$string =~ s/\x{017f}/ſ/g (for every char from \x00C0 to \x017F)

which has the merits of (a) working and (b) being simple enough for a simple soul such as myself to understand. However, there has to be a decent (and faster, more elegant, etc) solution to this. Sounds like one for the fans of "map" ...

AndyH

Comment on Re: Encoding of DBI PostgreSQL output Select or Download Code

Replies are listed 'Best First'.
Re: Re: Encoding of DBI PostgreSQL output by Kjetil (Sexton) on May 21, 2003 at 15:51 UTC
Looks like this is pretty much the solution for me too... I'm running my files through this: `while (<>) { s/\xc3\xa5/å/g; s/\xc3\xb8/ø/g; s/\xc3\xa6/æ/g; s/\xc3\xa9/é/g; s/\xc3\x85/Å/g; s/\xc3\x86/Æ/g; s/\xc3\x98/Ø/g; s/\xc3\x96/Ö/g; print; }` [download] I have been able to get a hexdump of the files, but I really don't understand what I read, because it looks as there are other characters inserted between those hex characters, e.g.: `6469 c320 7396 6c74` [download] `c3` is the `Ã` and `96` is indeed not in Latin1.... I think the latter probably has something to do with the question marks I got. Why this is so, or if it is significant is beyond me, and unfortunately, the pressure on me is such I won't have time to find out (hate that). But it works reasonably well.	[reply] [d/l] [select]
Re: Re: Re: Encoding of DBI PostgreSQL output by graff (Chancellor) on May 21, 2003 at 16:40 UTC
6469 c320 7396 6c74 Whoa! Now that's a symptom of something gone awry in your interpretation of the data. I'll bet that the intended string, dumped out as octets rather than as 16-bit "words", would be: `69 64 20 c3 96 73 74 6c i d <sp> <U00D6> s t l O-umlaut` [download] Note that your 16-bit rendering makes it look like there are at least two errors in the string; shuffling the bytes back to their true order makes both problems go away. When you handle ~~utf8 data~~ a utf8 string as binary data, ALWAYS treat it as bytes, NEVER as 16-bit words. (update: when handling a utf8 string in perl 5.6 or later as a perl-internal unicode character string, you will of course treat it as characters, and you'll stop thinking in terms of bytes.) This will save you from byte-order issues, which are obviously coming into play here. It's an intrinsic part of utf8's design.	[reply] [d/l]
Re: Re: Re: Encoding of DBI PostgreSQL output by AndyH (Sexton) on May 22, 2003 at 13:38 UTC
Found the answer (for my problem, anyway) here: http://www.perldoc.com/perl5.8.0/pod/perluniintro.html#Displaying-Unicode-As-Text and adapted it to solve my need thus: `sub hexent { my $utf8=shift; # convert utf8 characters greater than 255 into hex entities my $mapped=join("", map { $_ > 255 ? sprintf("&#x%04X;", $_) : chr +($_) } unpack("U*", $utf8));; return $mapped; }` [download] Four of hardest-to-understand Perl function in one command line! Hope this helps you with your problem ... AndyH	[reply] [d/l]