in reply to Re: Need Help With Seemingly Bizarre Unicode Task
in thread Need Help With Seemingly Bizarre Unicode Task

Thank you very much, graff and ikegami. I picked and chose from both your responses to create the script below. It reads the file CP1252.TXT on the Unicode.org Web site and, from it, generates the peculiar chart of broken Unicode characters I need for my "seemingly bizarre" purpose.

It's actually not so bizarre. I'm helping diagnose a problem with a large system written in Visual Basic 6. It corrupts text -- lots of text. In addition to diagnosing the problem, I intend to use a Perl script to remediate as much of the damage done by the system as possible. The chart generated by the script below allows me to determine easily what damage I can and cannot repair.

I needed your help. I was struggling with the bitwise operation to mimic the data corruption I'm modelling (& 7F) and I also needed guidance using oct, chr, ord, sprintf and Encode.

A few notes:

Please feel free to critique my script. All suggestions for improvement are gladly welcome. Thanks!

#!C:/Perl/bin/perl.exe use strict; use warnings; use Encode qw( encode ); use English qw( -no_match_vars ); use Fatal qw( open close ); use LWP::Simple qw( mirror ); local $OUTPUT_FIELD_SEPARATOR = "\t"; local $OUTPUT_RECORD_SEPARATOR = "\n"; my $url = my $file = 'http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP125 +2.TXT'; $file =~ s{.*/}{}; -f $file or mirror($url, $file); -f $file or die "You must first download the file $file:\n\n$url\n"; binmode STDOUT, ':utf8'; open my $fh, '<', $file; while (<$fh>) { # Parse only lines from 0x80 thru 0xFF... next if not m{ ^(0x[89A-F][[:xdigit:]]) # CP1252 hexadecimal string \s+ (0x[[:xdigit:]]{4})? # Unicode hexadecimal string \s+ \# (.+) # Unicode name }x; my $cp1252_integer = oct $1; my $unicode_integer = defined $2 ? oct $2 : 0xFFFD; my $unicode_name = $3; my $cp1252_hexstring = sprintf '0x%02X', $cp1252_integer; my $unicode_hexstring = sprintf 'U+%04X', $unicode_integer; my $unicode_character = chr $unicode_integer; my @utf8_octets = map { ord } split m//, encode('UTF-8', $unicode_character); my $utf8_hexstring = join ' ', map { sprintf '%02X', $_ } @utf8_octets; my $utf8_binstring = join ' ', map { sprintf '%08b', $_ } @utf8_octets; my @corrupted_octets # = 62 02 2C = 01100010 00000010 00101100 = map { $_ & 0x7F } # & 7F 7F 7F & 01111111 01111111 01111111 @utf8_octets; # E2 82 AC 11100010 10000010 10101100 my $corrupted_hexstring = join ' ', map { sprintf '%02X', $_ } @corrupted_octets; my $corrupted_binstring = join ' ', map { sprintf '%08b', $_ } @corrupted_octets; my $corrupted_string = join '', map { ($_ > 0x20 && $_ < 0x5C) || ($_ > 0x5C && $_ < 0x7F) ? chr $_ : sprintf '\\x%02X', $_ } @corrupted_octets; print $unicode_name, $unicode_character, $cp1252_hexstring, $unicode_hexstring, $utf8_hexstring, $utf8_binstring, $corrupted_hexstring, $corrupted_binstring, $corrupted_string; } close $fh; exit 0;