in reply to Encoding of DBI PostgreSQL output

I have a similar problem and am driving myself crazy with it.

I need to convert a string in Perl's internal format and which is guaranteed to contain no characters higher than \x{017F} (i.e. it's ASCII, plus extended Latin A) to "hex entities" e.g. ſ to be sent to a browser (don't suggest changing browser or changing the encoding layer of the CGI talking to the browser - only the hex entity solution will work in my particular case!).

I got so fed up playing with encode/decode, pack/unpack, et al, that I just did it as a clumsy mapping table (deadline was pressing) e.g.

$string =~ s/\x{017f}/ſ/g (for every char from \x00C0 to \x017F)

which has the merits of (a) working and (b) being simple enough for a simple soul such as myself to understand. However, there has to be a decent (and faster, more elegant, etc) solution to this. Sounds like one for the fans of "map" ...

AndyH

Replies are listed 'Best First'.
Re: Re: Encoding of DBI PostgreSQL output
by Kjetil (Sexton) on May 21, 2003 at 15:51 UTC
    Looks like this is pretty much the solution for me too...

    I'm running my files through this:

    while (<>) { s/\xc3\xa5/å/g; s/\xc3\xb8/ø/g; s/\xc3\xa6/æ/g; s/\xc3\xa9/é/g; s/\xc3\x85/Å/g; s/\xc3\x86/Æ/g; s/\xc3\x98/Ø/g; s/\xc3\x96/Ö/g; print; }

    I have been able to get a hexdump of the files, but I really don't understand what I read, because it looks as there are other characters inserted between those hex characters, e.g.:

    6469 c320 7396 6c74

    c3 is the à and 96 is indeed not in Latin1.... I think the latter probably has something to do with the question marks I got.

    Why this is so, or if it is significant is beyond me, and unfortunately, the pressure on me is such I won't have time to find out (hate that). But it works reasonably well.

      6469 c320 7396 6c74

      Whoa! Now that's a symptom of something gone awry in your interpretation of the data.

      I'll bet that the intended string, dumped out as octets rather than as 16-bit "words", would be:

      69 64 20 c3 96 73 74 6c i d <sp> <U00D6> s t l O-umlaut
      Note that your 16-bit rendering makes it look like there are at least two errors in the string; shuffling the bytes back to their true order makes both problems go away.

      When you handle utf8 data a utf8 string as binary data, ALWAYS treat it as bytes, NEVER as 16-bit words. (update: when handling a utf8 string in perl 5.6 or later as a perl-internal unicode character string, you will of course treat it as characters, and you'll stop thinking in terms of bytes.) This will save you from byte-order issues, which are obviously coming into play here. It's an intrinsic part of utf8's design.

      Found the answer (for my problem, anyway) here:
      http://www.perldoc.com/perl5.8.0/pod/perluniintro.html#Displaying-Unicode-As-Text
      and adapted it to solve my need thus:
      sub hexent { my $utf8=shift; # convert utf8 characters greater than 255 into hex entities my $mapped=join("", map { $_ > 255 ? sprintf("&#x%04X;", $_) : chr +($_) } unpack("U*", $utf8));; return $mapped; }

      Four of hardest-to-understand Perl function in one command line!

      Hope this helps you with your problem ...

      AndyH