It's actually not so bizarre. I'm helping diagnose a problem with a large system written in Visual Basic 6. It corrupts text -- lots of text. In addition to diagnosing the problem, I intend to use a Perl script to remediate as much of the damage done by the system as possible. The chart generated by the script below allows me to determine easily what damage I can and cannot repair.
I needed your help. I was struggling with the bitwise operation to mimic the data corruption I'm modelling (& 7F) and I also needed guidance using oct, chr, ord, sprintf and Encode.
A few notes:
Please feel free to critique my script. All suggestions for improvement are gladly welcome. Thanks!
#!C:/Perl/bin/perl.exe use strict; use warnings; use Encode qw( encode ); use English qw( -no_match_vars ); use Fatal qw( open close ); use LWP::Simple qw( mirror ); local $OUTPUT_FIELD_SEPARATOR = "\t"; local $OUTPUT_RECORD_SEPARATOR = "\n"; my $url = my $file = 'http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP125 +2.TXT'; $file =~ s{.*/}{}; -f $file or mirror($url, $file); -f $file or die "You must first download the file $file:\n\n$url\n"; binmode STDOUT, ':utf8'; open my $fh, '<', $file; while (<$fh>) { # Parse only lines from 0x80 thru 0xFF... next if not m{ ^(0x[89A-F][[:xdigit:]]) # CP1252 hexadecimal string \s+ (0x[[:xdigit:]]{4})? # Unicode hexadecimal string \s+ \# (.+) # Unicode name }x; my $cp1252_integer = oct $1; my $unicode_integer = defined $2 ? oct $2 : 0xFFFD; my $unicode_name = $3; my $cp1252_hexstring = sprintf '0x%02X', $cp1252_integer; my $unicode_hexstring = sprintf 'U+%04X', $unicode_integer; my $unicode_character = chr $unicode_integer; my @utf8_octets = map { ord } split m//, encode('UTF-8', $unicode_character); my $utf8_hexstring = join ' ', map { sprintf '%02X', $_ } @utf8_octets; my $utf8_binstring = join ' ', map { sprintf '%08b', $_ } @utf8_octets; my @corrupted_octets # = 62 02 2C = 01100010 00000010 00101100 = map { $_ & 0x7F } # & 7F 7F 7F & 01111111 01111111 01111111 @utf8_octets; # E2 82 AC 11100010 10000010 10101100 my $corrupted_hexstring = join ' ', map { sprintf '%02X', $_ } @corrupted_octets; my $corrupted_binstring = join ' ', map { sprintf '%08b', $_ } @corrupted_octets; my $corrupted_string = join '', map { ($_ > 0x20 && $_ < 0x5C) || ($_ > 0x5C && $_ < 0x7F) ? chr $_ : sprintf '\\x%02X', $_ } @corrupted_octets; print $unicode_name, $unicode_character, $cp1252_hexstring, $unicode_hexstring, $utf8_hexstring, $utf8_binstring, $corrupted_hexstring, $corrupted_binstring, $corrupted_string; } close $fh; exit 0;
In reply to Re^2: Need Help With Seemingly Bizarre Unicode Task
by Jim
in thread Need Help With Seemingly Bizarre Unicode Task
by Jim
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |