Re: contest problem about UTF-8

They basically want you to write your own UTF-8 decoder^[1] that ignores errors.

Don't work with the hex or binary representations of the values. Work with the numbers directly.

You could store those numbers in an array.

# my @input = ( 0xC1, 0xB3, 0xE0, 0x81, ... );
my @input = do { local $/; unpack "C*", pack "H*", <> =~ s/\s//gr };
[download]

Or you could store those numbers in a string.

# my $input = "\xC1\xB3\xE0\x81...";
my $input = do { local $/; pack "H*", <> =~ s/\s//gr };
[download]

The latter allows us to search the numbers using the regex engine. The following matches an encoded sequence, and will allow one to find sequences of three or more encoded values in a row very easily:

(?: [\x00-\x7F]
|   [\xC0-\xDF][\x80-\xBF]
|   [\xE0-\xEF][\x80-\xBF]{2}
|   ...
)
[download]

Once you've extracted the encoded values, it's just a question of decoding them.

my @bytes = unpack 'C*', $encoded_value;

if    (@bytes == 1) { push @output, $bytes[0]; }
elsif (@bytes == 2) { push @output, (( $bytes[0] & 0x1F ) << 6 ) | ( $
+bytes[0] & 0x3F ); }
elsif (@bytes == 3) { push @output, (( $bytes[0] & 0x0F ) << 12 ) | ((
+ $bytes[0] & 0x3F ) << 6 ) | ( $bytes[0] & 0x3F ); }
...
[download]

Back to hex:

say join " ", map sprintf("%X", $_), @output;
[download]

# Separated by "." instead of spaces.
say sprintf "%vX", pack "W*", @output;
[download]

Note that the encoding deviates from UTF-8 in a few respects:
- UTF-8 is limited to encoding values up to 0x10FFFF.
- UTF-8 is limited to four-bytes sequences.
- UTF-8 forbids using more bytes than necessary to encode a number.

Comment on Re: contest problem about UTF-8 Select or Download Code

Replies are listed 'Best First'.
Re^2: contest problem about UTF-8 by Anonymous Monk on Oct 07, 2017 at 23:22 UTC
UTF-8 is limited to encoding values up to 0x10FFF. *currently	[reply]
Re^3: contest problem about UTF-8 by ikegami (Patriarch) on Oct 08, 2017 at 02:09 UTC
They can always extend the spec, of course, but they've reserved of lot of space so they wouldn't need to do so. Before now, they've reduced the max to 0x10FFFF.	[reply]