in reply to contest problem about UTF-8

They basically want you to write your own UTF-8 decoder[1] that ignores errors.

Don't work with the hex or binary representations of the values. Work with the numbers directly.

You could store those numbers in an array.

# my @input = ( 0xC1, 0xB3, 0xE0, 0x81, ... ); my @input = do { local $/; unpack "C*", pack "H*", <> =~ s/\s//gr };

Or you could store those numbers in a string.

# my $input = "\xC1\xB3\xE0\x81..."; my $input = do { local $/; pack "H*", <> =~ s/\s//gr };

The latter allows us to search the numbers using the regex engine. The following matches an encoded sequence, and will allow one to find sequences of three or more encoded values in a row very easily:

(?: [\x00-\x7F] | [\xC0-\xDF][\x80-\xBF] | [\xE0-\xEF][\x80-\xBF]{2} | ... )

Once you've extracted the encoded values, it's just a question of decoding them.

my @bytes = unpack 'C*', $encoded_value; if (@bytes == 1) { push @output, $bytes[0]; } elsif (@bytes == 2) { push @output, (( $bytes[0] & 0x1F ) << 6 ) | ( $ +bytes[0] & 0x3F ); } elsif (@bytes == 3) { push @output, (( $bytes[0] & 0x0F ) << 12 ) | (( + $bytes[0] & 0x3F ) << 6 ) | ( $bytes[0] & 0x3F ); } ...

Back to hex:

say join " ", map sprintf("%X", $_), @output;
or
# Separated by "." instead of spaces. say sprintf "%vX", pack "W*", @output;

  1. Note that the encoding deviates from UTF-8 in a few respects:

    • UTF-8 is limited to encoding values up to 0x10FFFF.
    • UTF-8 is limited to four-bytes sequences.
    • UTF-8 forbids using more bytes than necessary to encode a number.

Replies are listed 'Best First'.
Re^2: contest problem about UTF-8
by Anonymous Monk on Oct 07, 2017 at 23:22 UTC
    UTF-8 is limited to encoding values up to 0x10FFF.
    *currently

      They can always extend the spec, of course, but they've reserved of lot of space so they wouldn't need to do so. Before now, they've *reduced* the max to 0x10FFFF.