I've always found unpack() and bit manipulation confusing. Here's my variation on the theme that uses ord() and sprintf() instead of unpack(). This script takes advantage of the fact that Unicode::UCD::charinfo() returns undef for unassigned code points and non-characters.

#!perl use strict; use warnings; use v5.12; use Encode qw( encode ); use English qw( -no_match_vars ); use Unicode::UCD qw( charinfo ); binmode STDOUT, ':encoding(UTF-8)'; # Include the Unicode byte order mark... print "\x{FEFF}"; local $OUTPUT_AUTOFLUSH = 1; local $OUTPUT_RECORD_SEPARATOR = "\n"; local $OUTPUT_FIELD_SEPARATOR = "\t"; CODE: for my $code (0x0000 .. 0x10FFFF) { # Don't complain about surrogate codes... no warnings qw( utf8 ); my $charinfo = charinfo($code); # Skip unassigned code points and non-characters... next CODE unless defined $charinfo; my $codepoint = sprintf 'U+%06X', $code; my $character = chr $code; my $name = $charinfo->{'name'}; my $category = $charinfo->{'category'}; my $block = $charinfo->{'block'}; my $script = $charinfo->{'script'}; my @utf8_octets = map { ord } split m//, encode('UTF-8', $character); my $utf8_hexstring = join ' ', map { sprintf '%02X', $_ } @utf8_octets; my $utf8_binstring = join ' ', map { sprintf '%08b', $_ } @utf8_octets; # Don't try to print unprintable or private use characters... $character = '' if $category eq 'Cc' || $category eq 'Co' || $category eq 'Cs'; print $character, $code, $codepoint, $utf8_hexstring, $utf8_binstring, $name, $category, $block, $script; } exit 0;

Jim

Update:  Here's a revised version of the script that handles surrogate code points more appropriately. And for comparison, I've used unpack('C*', ...). ☺

#!perl use strict; use warnings; use v5.12; use Encode qw( encode_utf8 ); use English qw( -no_match_vars ); use Unicode::UCD qw( charinfo ); binmode STDOUT, ':encoding(UTF-8)'; # Include a Unicode byte order mark in the output... print "\x{FEFF}"; local $OUTPUT_AUTOFLUSH = 1; local $OUTPUT_RECORD_SEPARATOR = "\n"; local $OUTPUT_FIELD_SEPARATOR = "\t"; CODE: for my $code (0x000000 .. 0x10FFFF) { # Look up the code point in the Unicode Character Database... my $charinfo = charinfo($code); # Skip unassigned code points and non-characters... next CODE unless defined $charinfo; my $codepoint = sprintf 'U+%06X', $code; my $character = chr $code; my $name = $charinfo->{'name'}; my $category = $charinfo->{'category'}; my $block = $charinfo->{'block'}; my $script = $charinfo->{'script'}; my @utf8_octets = unpack 'C*', encode_utf8($character); my $utf8_hex_string = join ' ', map { sprintf '%02X', $ARG } @utf8_octets; my $utf8_bin_string = join ' ', map { sprintf '%08b', $ARG } @utf8_octets; # Don't try to print unprintable or private use characters... if ($category =~ m/^C[cfos]$/) { $character = ''; # Don't falsely represent surrogates as valid UTF-8... if ($category eq 'Cs') { $utf8_hex_string = $utf8_bin_string = ''; } } print $character, $code, $codepoint, $utf8_hex_string, $utf8_bin_string, $name, $category, $block, $script; } exit 0;

Another update:  I removed this…

# Don't complain about surrogates... no warnings qw( surrogate );

…from the script because I realized it's not doing anything. I'm already skipping trying to print surrogates later in the script, so suppressing warnings about them isn't necessary.


In reply to Re: How to print the actual bytes of UTF-8 characters ? by Jim
in thread How to print the actual bytes of UTF-8 characters ? by RCH

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.