Jim has asked for the wisdom of the Perl Monks concerning the following question:

I have a hexadecimal string representation of a Unicode codepoint:

my $unicode_character_hexadecimal_string = '0x20ac';
First, I want to convert this hexadecimal string representation of a Unicode codepoint to an integer. Easy enough:

my $unicode_codepoint_integer = eval $unicode_character_hexadecimal_string;
Right?

From here, I want to get to some hexadecimal string representation of the UTF-8 encoding of this Unicode codepoint:

'0xE2 0x82 0xAC'
or perhaps simply

'E2 82 AC'
Then, as strange as it seems, I want to get to an analagous hexadecimal string representation of the same sequence of bytes with the most significant bit turned off (i.e, with 0x80 substracted):

'0x62 0x02 0x2C'
or

'62 02 2C'
Oh, and along the way, I also want to get binary string representations of the same values:

'11100010 10000010 10101100'
and
'01100010 00000010 00101100'
Finally, I want to print the Unicode (UTF-8) characters alongside these various string representations.

I'm trying to generate a kind of metavalue table. You have to trust me: I really do want to do exactly what I've outlined above. I'm using Perl 5.8.8 (ActivePerl build 822).

Thanks.

Jim

Replies are listed 'Best First'.
Re: Need Help With Seemingly Bizarre Unicode Task
by graff (Chancellor) on Dec 30, 2007 at 06:21 UTC
    I have a hexadecimal string representation of a Unicode codepoint:
    my $unicode_character_hexadecimal_string = '0x20ac';

    It would be easier/better to leave off the quotes -- this way, you don't need to use eval later on:

    my $unicode_hex_codepoint = 0x20ac;
    From here, I want to get to some hexadecimal string representation of the UTF-8 encoding of this Unicode codepoint:
    '0xE2 0x82 0xAC'

    I don't understand why you would want to do that. Normally, you want to go directly to a perl-internal utf8 character:

    my $unicode_char = chr( $unicode_hex_codepoint );
    Then, as strange as it seems, I want to get to an analagous hexadecimal string representation of the same sequence of bytes with the most significant bit turned off (i.e, with 0x80 substracted):
    '0x62 0x02 0x2C'

    You're right. That does seem very strange. I'll be losing sleep trying to imagine what sort of purpose this could possibly serve. In any case, if your ultimate goal is a print-out that looks something like this (I'm just guessing about the format):

    
      €  ==  20ac  == e2 82 ac == 11100010 10000010 10101100 ^^ 62 02 2c == 01100010 00000010 00101100
    
    
    Then something like this, perhaps:
    use strict; my $uni_hex = 0x20ac; my $uni_chr = chr($uni_hex); my ( $u8_byts, $strpd_byts ); $u8_byts .= sprintf( "%02x ", $_) for ( unpack( "C*", $uni_chr )); $strpd_byts .= sprintf( "%02x ", $_ & 0x7f ) for ( unpack( "C*", $uni_ +chr )); ( my $u8_bits = unpack( "B*", $uni_chr )) =~ s/(.{8})/$1 /g; ( my $strpd_bits = $u8_bits ) =~ s/\b1/0/g; printf( "%s == %04x == %s == %s ^^ %s == %s\n", $uni_chr, $uni_hex, $u8_byts, $u8_bits, $strpd_byts, $strpd_bi +ts );
    There are other ways to do it, which might be more suitable, depending on why you really want to do this (and what sorts of data you'll be dealing with).
Re: Need Help With Seemingly Bizarre Unicode Task
by ikegami (Patriarch) on Dec 30, 2007 at 07:23 UTC

    Use oct instead of eval.
    Use Encode's encode to find the UTF-8 encoding.

    use Encode qw( encode ); my $unicode_codepoint_hex_lit = '0x20ac'; my $unicode_codepoint_integer = oct( $unicode_codepoint_hex_lit ); my $unicode_char = chr( $unicode_codepoint_integer ); my $utf8_bytes = encode( 'UTF-8', $unicode_char ); my @utf8_bytes = map ord, $utf8_bytes =~ /./sg; my @nohi_bytes = map $_ & 0x7F, @utf8_bytes; print(join(' ', map sprintf('%02X', $_), @utf8_bytes), "\n"); print(join(' ', map sprintf('0x%02X', $_), @utf8_bytes), "\n"); print(join(' ', map unpack('B8', pack('C', $_)), @utf8_bytes), "\n"); print("\n"); print(join(' ', map sprintf('%02X', $_), @nohi_bytes), "\n"); print(join(' ', map sprintf('0x%02X', $_), @nohi_bytes), "\n"); print(join(' ', map unpack('B8', pack('C', $_)), @nohi_bytes), "\n");
    E2 82 AC 0xE2 0x82 0xAC 11100010 10000010 10101100 62 02 2C 0x62 0x02 0x2C 01100010 00000010 00101100
      Thank you very much, graff and ikegami. I picked and chose from both your responses to create the script below. It reads the file CP1252.TXT on the Unicode.org Web site and, from it, generates the peculiar chart of broken Unicode characters I need for my "seemingly bizarre" purpose.

      It's actually not so bizarre. I'm helping diagnose a problem with a large system written in Visual Basic 6. It corrupts text -- lots of text. In addition to diagnosing the problem, I intend to use a Perl script to remediate as much of the damage done by the system as possible. The chart generated by the script below allows me to determine easily what damage I can and cannot repair.

      I needed your help. I was struggling with the bitwise operation to mimic the data corruption I'm modelling (& 7F) and I also needed guidance using oct, chr, ord, sprintf and Encode.

      A few notes:

      • You must use Encode rather than depend on the fact that Perl uses UTF-8 for its internal representation of strings. Both perlunitut and perlunifaq are adamant about this point. Now I understand why. As the documentation of chr explains: "[C]haracters from 128 to 255 (inclusive) are by default internally not encoded as UTF-8 for backward compatibility reasons."
      • At first, I used perl -CO to suppress the "Wide character in print..." warning message. Then I put binmode STDOUT, ':utf8'; into the script itself, which is much better.
      • I used sprintf('%08b', $_) in lieu of unpack('B8', pack('C', $_)). Happily, I didn't have to use pack or unpack at all.

      Please feel free to critique my script. All suggestions for improvement are gladly welcome. Thanks!