in reply to contest problem about UTF-8
They basically want you to write your own UTF-8 decoder[1] that ignores errors.
Don't work with the hex or binary representations of the values. Work with the numbers directly.
You could store those numbers in an array.
# my @input = ( 0xC1, 0xB3, 0xE0, 0x81, ... ); my @input = do { local $/; unpack "C*", pack "H*", <> =~ s/\s//gr };
Or you could store those numbers in a string.
# my $input = "\xC1\xB3\xE0\x81..."; my $input = do { local $/; pack "H*", <> =~ s/\s//gr };
The latter allows us to search the numbers using the regex engine. The following matches an encoded sequence, and will allow one to find sequences of three or more encoded values in a row very easily:
(?: [\x00-\x7F] | [\xC0-\xDF][\x80-\xBF] | [\xE0-\xEF][\x80-\xBF]{2} | ... )
Once you've extracted the encoded values, it's just a question of decoding them.
my @bytes = unpack 'C*', $encoded_value; if (@bytes == 1) { push @output, $bytes[0]; } elsif (@bytes == 2) { push @output, (( $bytes[0] & 0x1F ) << 6 ) | ( $ +bytes[0] & 0x3F ); } elsif (@bytes == 3) { push @output, (( $bytes[0] & 0x0F ) << 12 ) | (( + $bytes[0] & 0x3F ) << 6 ) | ( $bytes[0] & 0x3F ); } ...
Back to hex:
orsay join " ", map sprintf("%X", $_), @output;
# Separated by "." instead of spaces. say sprintf "%vX", pack "W*", @output;
Note that the encoding deviates from UTF-8 in a few respects:
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: contest problem about UTF-8
by Anonymous Monk on Oct 07, 2017 at 23:22 UTC | |
by ikegami (Patriarch) on Oct 08, 2017 at 02:09 UTC |