comment on

Here's a Perl script that counts the number of bytes, code points, and graphemes in each UTF-8 encoded word. It also tallies the code points by Unicode blocks.

#!perl

use v5.14;
use strict;
use warnings;
use utf8;

use Encode qw( encode_utf8 );
use Unicode::UCD qw( charblock );

binmode STDOUT, ':encoding(UTF-8)';

while (my $word = <DATA>) {
    chomp $word;

    my $length_in_bytes       = length_in_bytes($word);
    my $length_in_code_points = length_in_code_points($word);
    my $length_in_graphemes   = length_in_graphemes($word);
    my $code_points_in_blocks = code_points_in_blocks($word);

    printf "%-12s | Bytes: %2d | Code Points: %2d | Graphemes: %2d | Blocks: %s\n",
        $word,
        $length_in_bytes,
        $length_in_code_points,
        $length_in_graphemes,
        $code_points_in_blocks;
}

exit 0;

sub length_in_bytes {
    my $word = shift;

    my $length = length encode_utf8($word);

    return $length;
}

sub length_in_code_points {
    my $word = shift;

    my $length = length $word;

    return $length;
}

sub length_in_graphemes {
    my $word = shift;

    my $length = () = $word =~ m/\X/g;

    return $length;
}

sub code_points_in_blocks {
    my $word = shift;

    my %total_code_points_by;
    my $blocks = '';

    for my $character (split m//, $word) {
        my $block = charblock(ord $character);

        $total_code_points_by{$block}++;
    }

    for my $block (sort keys %total_code_points_by) {
        my $total = $total_code_points_by{$block};

        $blocks .= sprintf "%s%s (%d)",
                   (length $blocks ? ', ' : ''), $block, $total;
    }

    return $blocks;
}

__DATA__
æ
æð
æða
æðaber
æðahnútur
æðakölkun
æðardúnn
æðarfugl
æðarkolla
æðarkóngur
æðarvarp
æði
æðimargur
æðisgenginn
æðiskast
æðislegur
æðrast
æðri
æðrulaus
æðruleysi
æðruorð
æðrutónn
æðstur
æður
æfa

Here's the output of the script.

æ            | Bytes:  2 | Code Points:  1 | Graphemes:  1 | Blocks: Latin-1 Supplement (1)
æð           | Bytes:  4 | Code Points:  2 | Graphemes:  2 | Blocks: Latin-1 Supplement (2)
æða          | Bytes:  5 | Code Points:  3 | Graphemes:  3 | Blocks: Basic Latin (1), Latin-1 Supplement (2)
æðaber       | Bytes:  8 | Code Points:  6 | Graphemes:  6 | Blocks: Basic Latin (4), Latin-1 Supplement (2)
æðahnútur    | Bytes: 12 | Code Points:  9 | Graphemes:  9 | Blocks: Basic Latin (6), Latin-1 Supplement (3)
æðakölkun    | Bytes: 12 | Code Points:  9 | Graphemes:  9 | Blocks: Basic Latin (6), Latin-1 Supplement (3)
æðardúnn     | Bytes: 11 | Code Points:  8 | Graphemes:  8 | Blocks: Basic Latin (5), Latin-1 Supplement (3)
æðarfugl     | Bytes: 10 | Code Points:  8 | Graphemes:  8 | Blocks: Basic Latin (6), Latin-1 Supplement (2)
æðarkolla    | Bytes: 11 | Code Points:  9 | Graphemes:  9 | Blocks: Basic Latin (7), Latin-1 Supplement (2)
æðarkóngur   | Bytes: 13 | Code Points: 10 | Graphemes: 10 | Blocks: Basic Latin (7), Latin-1 Supplement (3)
æðarvarp     | Bytes: 10 | Code Points:  8 | Graphemes:  8 | Blocks: Basic Latin (6), Latin-1 Supplement (2)
æði          | Bytes:  5 | Code Points:  3 | Graphemes:  3 | Blocks: Basic Latin (1), Latin-1 Supplement (2)
æðimargur    | Bytes: 11 | Code Points:  9 | Graphemes:  9 | Blocks: Basic Latin (7), Latin-1 Supplement (2)
æðisgenginn  | Bytes: 13 | Code Points: 11 | Graphemes: 11 | Blocks: Basic Latin (9), Latin-1 Supplement (2)
æðiskast     | Bytes: 10 | Code Points:  8 | Graphemes:  8 | Blocks: Basic Latin (6), Latin-1 Supplement (2)
æðislegur    | Bytes: 11 | Code Points:  9 | Graphemes:  9 | Blocks: Basic Latin (7), Latin-1 Supplement (2)
æðrast       | Bytes:  8 | Code Points:  6 | Graphemes:  6 | Blocks: Basic Latin (4), Latin-1 Supplement (2)
æðri         | Bytes:  6 | Code Points:  4 | Graphemes:  4 | Blocks: Basic Latin (2), Latin-1 Supplement (2)
æðrulaus     | Bytes: 10 | Code Points:  8 | Graphemes:  8 | Blocks: Basic Latin (6), Latin-1 Supplement (2)
æðruleysi    | Bytes: 11 | Code Points:  9 | Graphemes:  9 | Blocks: Basic Latin (7), Latin-1 Supplement (2)
æðruorð      | Bytes: 10 | Code Points:  7 | Graphemes:  7 | Blocks: Basic Latin (4), Latin-1 Supplement (3)
æðrutónn     | Bytes: 11 | Code Points:  8 | Graphemes:  8 | Blocks: Basic Latin (5), Latin-1 Supplement (3)
æðstur       | Bytes:  8 | Code Points:  6 | Graphemes:  6 | Blocks: Basic Latin (4), Latin-1 Supplement (2)
æður         | Bytes:  6 | Code Points:  4 | Graphemes:  4 | Blocks: Basic Latin (2), Latin-1 Supplement (2)
æfa          | Bytes:  4 | Code Points:  3 | Graphemes:  3 | Blocks: Basic Latin (2), Latin-1 Supplement (1)

UPDATE: If you add these three words to the end of the list in the __DATA__ block of the the UTF-8 encoded Perl script…

한국말
piñón
piñón

…then the report will include these three lines…

한국말          | Bytes:  9 | Code Points:  3 | Graphemes:  3 | Blocks: Hangul Syllables (3)
piñón        | Bytes:  7 | Code Points:  5 | Graphemes:  5 | Blocks: Basic Latin (3), Latin-1 Supplement (2)
piñón      | Bytes:  9 | Code Points:  7 | Graphemes:  5 | Blocks: Basic Latin (5), Combining Diacritical Marks (2)

In reply to Re: length() miscounting UTF8 characters? by Jim
in thread length() miscounting UTF8 characters? by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.