comment on

Thank you very much, graff and ikegami. I picked and chose from both your responses to create the script below. It reads the file CP1252.TXT on the Unicode.org Web site and, from it, generates the peculiar chart of broken Unicode characters I need for my "seemingly bizarre" purpose.

It's actually not so bizarre. I'm helping diagnose a problem with a large system written in Visual Basic 6. It corrupts text -- lots of text. In addition to diagnosing the problem, I intend to use a Perl script to remediate as much of the damage done by the system as possible. The chart generated by the script below allows me to determine easily what damage I can and cannot repair.

I needed your help. I was struggling with the bitwise operation to mimic the data corruption I'm modelling (& 7F) and I also needed guidance using oct, chr, ord, sprintf and Encode.

A few notes:

You must use Encode rather than depend on the fact that Perl uses UTF-8 for its internal representation of strings. Both perlunitut and perlunifaq are adamant about this point. Now I understand why. As the documentation of chr explains: "[C]haracters from 128 to 255 (inclusive) are by default internally not encoded as UTF-8 for backward compatibility reasons."
At first, I used perl -CO to suppress the "Wide character in print..." warning message. Then I put binmode STDOUT, ':utf8'; into the script itself, which is much better.
I used sprintf('%08b', $_) in lieu of unpack('B8', pack('C', $_)). Happily, I didn't have to use pack or unpack at all.

Please feel free to critique my script. All suggestions for improvement are gladly welcome. Thanks!

#!C:/Perl/bin/perl.exe

use strict;
use warnings;
use Encode qw( encode );
use English qw( -no_match_vars );
use Fatal qw( open close );
use LWP::Simple qw( mirror );

local $OUTPUT_FIELD_SEPARATOR  = "\t";
local $OUTPUT_RECORD_SEPARATOR = "\n";

my $url = my $file
    = 'http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP125
+2.TXT';

$file =~ s{.*/}{};

-f $file or mirror($url, $file);

-f $file or die "You must first download the file $file:\n\n$url\n";

binmode STDOUT, ':utf8';

open my $fh, '<', $file;

while (<$fh>) {
    # Parse only lines from 0x80 thru 0xFF...
    next if not m{
       ^(0x[89A-F][[:xdigit:]]) # CP1252 hexadecimal string
        \s+
        (0x[[:xdigit:]]{4})?    # Unicode hexadecimal string
        \s+ \#
        (.+)                    # Unicode name
    }x;

    my $cp1252_integer    = oct $1;
    my $unicode_integer   = defined $2 ? oct $2 : 0xFFFD;
    my $unicode_name      = $3;

    my $cp1252_hexstring  = sprintf '0x%02X', $cp1252_integer;
    my $unicode_hexstring = sprintf 'U+%04X', $unicode_integer;
    my $unicode_character = chr $unicode_integer;

    my @utf8_octets
        = map { ord }
          split m//, encode('UTF-8', $unicode_character);

    my $utf8_hexstring
        = join ' ',
          map { sprintf '%02X', $_ }
          @utf8_octets;

    my $utf8_binstring
        = join ' ',
          map { sprintf '%08b', $_ }
          @utf8_octets;

    my @corrupted_octets    # = 62 02 2C  = 01100010 00000010 00101100
        = map { $_ & 0x7F } # & 7F 7F 7F  & 01111111 01111111 01111111
          @utf8_octets;     #   E2 82 AC    11100010 10000010 10101100

    my $corrupted_hexstring
        = join ' ',
          map { sprintf '%02X', $_ }
          @corrupted_octets;

    my $corrupted_binstring
        = join ' ',
          map { sprintf '%08b', $_ }
          @corrupted_octets;

    my $corrupted_string
        = join '',
          map {
              ($_ > 0x20 && $_ < 0x5C) || ($_ > 0x5C && $_ < 0x7F)
              ? chr $_
              : sprintf '\\x%02X', $_ 
          }
          @corrupted_octets;

    print $unicode_name,
          $unicode_character,
          $cp1252_hexstring,
          $unicode_hexstring,
          $utf8_hexstring,
          $utf8_binstring,
          $corrupted_hexstring,
          $corrupted_binstring,
          $corrupted_string;
}

close $fh;

exit 0;
[download]

In reply to Re^2: Need Help With Seemingly Bizarre Unicode Task by Jim
in thread Need Help With Seemingly Bizarre Unicode Task by Jim

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.