comment on

I've always found unpack() and bit manipulation confusing. Here's my variation on the theme that uses ord() and sprintf() instead of unpack(). This script takes advantage of the fact that Unicode::UCD::charinfo() returns undef for unassigned code points and non-characters.

#!perl

use strict;
use warnings;
use v5.12;

use Encode qw( encode );
use English qw( -no_match_vars );
use Unicode::UCD qw( charinfo );

binmode STDOUT, ':encoding(UTF-8)';

# Include the Unicode byte order mark...
print "\x{FEFF}";

local $OUTPUT_AUTOFLUSH        = 1;
local $OUTPUT_RECORD_SEPARATOR = "\n";
local $OUTPUT_FIELD_SEPARATOR  = "\t";

CODE:
for my $code (0x0000 .. 0x10FFFF) {
    # Don't complain about surrogate codes...
    no warnings qw( utf8 );

    my $charinfo = charinfo($code);

    # Skip unassigned code points and non-characters...
    next CODE unless defined $charinfo;

    my $codepoint = sprintf 'U+%06X', $code;
    my $character = chr $code;
    my $name      = $charinfo->{'name'};
    my $category  = $charinfo->{'category'};
    my $block     = $charinfo->{'block'};
    my $script    = $charinfo->{'script'};

    my @utf8_octets
        = map { ord }
          split m//, encode('UTF-8', $character);

    my $utf8_hexstring
        = join ' ',
          map { sprintf '%02X', $_ }
          @utf8_octets;

    my $utf8_binstring
        = join ' ',
          map { sprintf '%08b', $_ }
          @utf8_octets;

    # Don't try to print unprintable or private use characters...
    $character = '' if $category eq 'Cc'
                    || $category eq 'Co'
                    || $category eq 'Cs';

    print $character,
          $code,
          $codepoint,
          $utf8_hexstring,
          $utf8_binstring,
          $name,
          $category,
          $block,
          $script;
}

exit 0;
[download]

Jim

Update: Here's a revised version of the script that handles surrogate code points more appropriately. And for comparison, I've used unpack('C*', ...). ☺

#!perl

use strict;
use warnings;
use v5.12;

use Encode qw( encode_utf8 );
use English qw( -no_match_vars );
use Unicode::UCD qw( charinfo );

binmode STDOUT, ':encoding(UTF-8)';

# Include a Unicode byte order mark in the output...
print "\x{FEFF}";

local $OUTPUT_AUTOFLUSH        = 1;
local $OUTPUT_RECORD_SEPARATOR = "\n";
local $OUTPUT_FIELD_SEPARATOR  = "\t";

CODE:
for my $code (0x000000 .. 0x10FFFF) {
    # Look up the code point in the Unicode Character Database...
    my $charinfo = charinfo($code);

    # Skip unassigned code points and non-characters...
    next CODE unless defined $charinfo;

    my $codepoint = sprintf 'U+%06X', $code;
    my $character = chr $code;
    my $name      = $charinfo->{'name'};
    my $category  = $charinfo->{'category'};
    my $block     = $charinfo->{'block'};
    my $script    = $charinfo->{'script'};

    my @utf8_octets
        = unpack 'C*', encode_utf8($character);

    my $utf8_hex_string
        = join ' ', map { sprintf '%02X', $ARG } @utf8_octets;

    my $utf8_bin_string
        = join ' ', map { sprintf '%08b', $ARG } @utf8_octets;

   # Don't try to print unprintable or private use characters...
   if ($category =~ m/^C[cfos]$/) {
         $character = '';

        # Don't falsely represent surrogates as valid UTF-8...
        if ($category eq 'Cs') {
            $utf8_hex_string = $utf8_bin_string = '';
        }
    }

    print $character,
          $code,
          $codepoint,
          $utf8_hex_string,
          $utf8_bin_string,
          $name,
          $category,
          $block,
          $script;
}

exit 0;
[download]

Another update: I removed this…

    # Don't complain about surrogates...
    no warnings qw( surrogate );
[download]

…from the script because I realized it's not doing anything. I'm already skipping trying to print surrogates later in the script, so suppressing warnings about them isn't necessary.

In reply to Re: How to print the actual bytes of UTF-8 characters ? by Jim
in thread How to print the actual bytes of UTF-8 characters ? by RCH

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.