comment on

They basically want you to write your own UTF-8 decoder^[1] that ignores errors.

Don't work with the hex or binary representations of the values. Work with the numbers directly.

You could store those numbers in an array.

# my @input = ( 0xC1, 0xB3, 0xE0, 0x81, ... );
my @input = do { local $/; unpack "C*", pack "H*", <> =~ s/\s//gr };
[download]

Or you could store those numbers in a string.

# my $input = "\xC1\xB3\xE0\x81...";
my $input = do { local $/; pack "H*", <> =~ s/\s//gr };
[download]

The latter allows us to search the numbers using the regex engine. The following matches an encoded sequence, and will allow one to find sequences of three or more encoded values in a row very easily:

(?: [\x00-\x7F]
|   [\xC0-\xDF][\x80-\xBF]
|   [\xE0-\xEF][\x80-\xBF]{2}
|   ...
)
[download]

Once you've extracted the encoded values, it's just a question of decoding them.

my @bytes = unpack 'C*', $encoded_value;

if    (@bytes == 1) { push @output, $bytes[0]; }
elsif (@bytes == 2) { push @output, (( $bytes[0] & 0x1F ) << 6 ) | ( $
+bytes[0] & 0x3F ); }
elsif (@bytes == 3) { push @output, (( $bytes[0] & 0x0F ) << 12 ) | ((
+ $bytes[0] & 0x3F ) << 6 ) | ( $bytes[0] & 0x3F ); }
...
[download]

Back to hex:

say join " ", map sprintf("%X", $_), @output;
[download]

# Separated by "." instead of spaces.
say sprintf "%vX", pack "W*", @output;
[download]

Note that the encoding deviates from UTF-8 in a few respects:
- UTF-8 is limited to encoding values up to 0x10FFFF.
- UTF-8 is limited to four-bytes sequences.
- UTF-8 forbids using more bytes than necessary to encode a number.

In reply to Re: contest problem about UTF-8 by ikegami
in thread contest problem about UTF-8 by rsFalse

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.