Re: Need Help With Seemingly Bizarre Unicode Task

Use oct instead of eval.
Use Encode's encode to find the UTF-8 encoding.

use Encode qw( encode );

my $unicode_codepoint_hex_lit = '0x20ac';
my $unicode_codepoint_integer = oct( $unicode_codepoint_hex_lit );
my $unicode_char              = chr( $unicode_codepoint_integer );
my $utf8_bytes                = encode( 'UTF-8', $unicode_char );
my @utf8_bytes                = map ord, $utf8_bytes =~ /./sg;
my @nohi_bytes                = map $_ & 0x7F, @utf8_bytes;

print(join(' ', map sprintf('%02X', $_),         @utf8_bytes), "\n");
print(join(' ', map sprintf('0x%02X', $_),       @utf8_bytes), "\n");
print(join(' ', map unpack('B8', pack('C', $_)), @utf8_bytes), "\n");
print("\n");
print(join(' ', map sprintf('%02X', $_),         @nohi_bytes), "\n");
print(join(' ', map sprintf('0x%02X', $_),       @nohi_bytes), "\n");
print(join(' ', map unpack('B8', pack('C', $_)), @nohi_bytes), "\n");
[download]

E2 82 AC
0xE2 0x82 0xAC
11100010 10000010 10101100
62 02 2C
0x62 0x02 0x2C
01100010 00000010 00101100
[download]

Comment on Re: Need Help With Seemingly Bizarre Unicode Task Select or Download Code

Replies are listed 'Best First'.
Re^2: Need Help With Seemingly Bizarre Unicode Task by Jim (Curate) on Dec 31, 2007 at 23:33 UTC
Thank you very much, graff and ikegami. I picked and chose from both your responses to create the script below. It reads the file CP1252.TXT on the Unicode.org Web site and, from it, generates the peculiar chart of broken Unicode characters I need for my "seemingly bizarre" purpose. It's actually not so bizarre. I'm helping diagnose a problem with a large system written in Visual Basic 6. It corrupts text -- lots of text. In addition to diagnosing the problem, I intend to use a Perl script to remediate as much of the damage done by the system as possible. The chart generated by the script below allows me to determine easily what damage I can and cannot repair. I needed your help. I was struggling with the bitwise operation to mimic the data corruption I'm modelling (& 7F) and I also needed guidance using oct, chr, ord, sprintf and Encode. A few notes: You must use Encode rather than depend on the fact that Perl uses UTF-8 for its internal representation of strings. Both perlunitut and perlunifaq are adamant about this point. Now I understand why. As the documentation of chr explains: "[C]haracters from 128 to 255 (inclusive) are by default internally not encoded as UTF-8 for backward compatibility reasons." At first, I used `perl -CO` to suppress the "Wide character in print..." warning message. Then I put `binmode STDOUT, ':utf8';` into the script itself, which is much better. I used `sprintf('%08b', $_)` in lieu of `unpack('B8', pack('C', $_))`. Happily, I didn't have to use pack or unpack at all. Please feel free to critique my script. All suggestions for improvement are gladly welcome. Thanks! Read more... (3 kB)	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^2: Need Help With Seemingly Bizarre Unicode Task
by Jim (Curate) on Dec 31, 2007 at 23:33 UTC

It's actually not so bizarre. I'm helping diagnose a problem with a large system written in Visual Basic 6. It corrupts text -- lots of text. In addition to diagnosing the problem, I intend to use a Perl script to remediate as much of the damage done by the system as possible. The chart generated by the script below allows me to determine easily what damage I can and cannot repair.

I needed your help. I was struggling with the bitwise operation to mimic the data corruption I'm modelling (& 7F) and I also needed guidance using oct, chr, ord, sprintf and Encode.

A few notes:

You must use Encode rather than depend on the fact that Perl uses UTF-8 for its internal representation of strings. Both perlunitut and perlunifaq are adamant about this point. Now I understand why. As the documentation of chr explains: "[C]haracters from 128 to 255 (inclusive) are by default internally not encoded as UTF-8 for backward compatibility reasons."
At first, I used perl -CO to suppress the "Wide character in print..." warning message. Then I put binmode STDOUT, ':utf8'; into the script itself, which is much better.
I used sprintf('%08b', $_) in lieu of unpack('B8', pack('C', $_)). Happily, I didn't have to use pack or unpack at all.

Please feel free to critique my script. All suggestions for improvement are gladly welcome. Thanks!