contest problem about UTF-8

rsFalse has asked for the wisdom of the Perl Monks concerning the following question:

Hello.

I was trying to solve a problem about UTF-8, and went with solution which is too slow. I want to ask suggestions how to optimize my code, or suggestions about better approaches. Maybe with using pack/unpack, but I've found perlpacktut not very clear for me.

I have no idea how to write a problem statement from .pdf paper, so I made some screenshots.
How would you solve this problem?

Test case #1:

C1 B3 E0 81 B3 F0 80 81 B3
F8 80 80 81 B3 FC 80 80 80
81 B3 80 C1 B3 E0 81 B3 F0
80 81 B3 F8 80 80 81 B3 FC 80
80 80 81 B3 E0 AF B5
[download]

Test case #2:

FF 00 FF D1 81 F0 80 91 81
00 D1 81 FF F0 80 91 81 FF
FF 00 D1 81 F0 80 91 81
[download]

Output for TC #1:

73 73 73 73 73
73 73 73 73 73 BF5
[download]

Output for TC #2:

441 441 0 441
0 441 441
[download]

My code which solves two tests correctly, but it is slow for bigger tests (can't see them), and maybe it is incorrect(?):

#!/usr/bin/perl

use warnings;
use strict;

$\ = $/;
my $debug = 0;

$_ = do { local $/; <> };

@_ = split;

$_ = join '', map { sprintf "%08b", eval "0x$_" } @_, 'FF';

my @data = reverse <DATA>;
chomp @data;

y/x/./ for @data;

my @rxs = join '|', @data;

my @r;
my @R;

while( /\G(?:@rxs|.{8})/g ){
    my $c = $&;
    $debug and print "c:$c";
    
    $c =~ /@rxs/ or do {
        push @R, [ @r ] if @r > 2;
        @r = ();
        $debug and print "   -F";
        next;
        };
    
    my $x = $c =~ s/.{8}/ ($&) =~ s!^1+0!!r /ger;
    $debug and print "x:",$x;
    
    push @r, ~~ reverse $x;
    }

my @acc;
my @ACC;

for my $R (@R){
    
    @acc = ();
    for my $r ( @{ $R } ){
        
        my $acc = '';
        
        while( $r =~ /.{1,4}/g ){
            $acc .= ( 0 .. 9, 'A' .. 'F' )[ eval "0b" . reverse $& ];
            }
        
        $acc = reverse $acc;
        $acc =~ s/^0+\B//;
        
        push @acc, $acc;
        }
        
    push @ACC, "@acc";
}

print for @ACC;

__DATA__
0xxxxxxx
110xxxxx10xxxxxx
1110xxxx10xxxxxx10xxxxxx
11110xxx10xxxxxx10xxxxxx10xxxxxx
111110xx10xxxxxx10xxxxxx10xxxxxx10xxxxxx
1111110x10xxxxxx10xxxxxx10xxxxxx10xxxxxx10xxxxxx
[download]

Comment on contest problem about UTF-8 Select or Download Code

Replies are listed 'Best First'.
Re: contest problem about UTF-8 by ikegami (Patriarch) on Oct 07, 2017 at 01:15 UTC
They basically want you to write your own UTF-8 decoder^[1] that ignores errors. Don't work with the hex or binary representations of the values. Work with the numbers directly. You could store those numbers in an array. `# my @input = ( 0xC1, 0xB3, 0xE0, 0x81, ... ); my @input = do { local $/; unpack "C", pack "H", <> =~ s/\s//gr };` [download] Or you could store those numbers in a string. `# my $input = "\xC1\xB3\xE0\x81..."; my $input = do { local $/; pack "H", <> =~ s/\s//gr };` [download] The latter allows us to search the numbers using the regex engine. The following matches an encoded sequence, and will allow one to find sequences of three or more encoded values in a row very easily: `(?: [\x00-\x7F] \| [\xC0-\xDF][\x80-\xBF] \| [\xE0-\xEF][\x80-\xBF]{2} \| ... )` [download] Once you've extracted the encoded values, it's just a question of decoding them. `my @bytes = unpack 'C', $encoded_value; if (@bytes == 1) { push @output, $bytes[0]; } elsif (@bytes == 2) { push @output, (( $bytes[0] & 0x1F ) << 6 ) \| ( $ +bytes[0] & 0x3F ); } elsif (@bytes == 3) { push @output, (( $bytes[0] & 0x0F ) << 12 ) \| (( + $bytes[0] & 0x3F ) << 6 ) \| ( $bytes[0] & 0x3F ); } ...` [download] Back to hex: `say join " ", map sprintf("%X", $_), @output;` [download] or `# Separated by "." instead of spaces. say sprintf "%vX", pack "W*", @output;` [download] Note that the encoding deviates from UTF-8 in a few respects: UTF-8 is limited to encoding values up to 0x10FFFF. UTF-8 is limited to four-bytes sequences. UTF-8 forbids using more bytes than necessary to encode a number.	[reply] [d/l] [select]
Re^2: contest problem about UTF-8 by Anonymous Monk on Oct 07, 2017 at 23:22 UTC
UTF-8 is limited to encoding values up to 0x10FFF. *currently	[reply]
Re^3: contest problem about UTF-8 by ikegami (Patriarch) on Oct 08, 2017 at 02:09 UTC
They can always extend the spec, of course, but they've reserved of lot of space so they wouldn't need to do so. Before now, they've reduced the max to 0x10FFFF.	[reply]
Re: contest problem about UTF-8 by LanX (Saint) on Oct 07, 2017 at 00:55 UTC
Does the "Open Cup named after E.V. Pankratiev - Grand Prix Eurasia" allow external help? :) Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply]
Re^2: contest problem about UTF-8 by rsFalse (Chaplain) on Oct 07, 2017 at 11:51 UTC
Contest is finished now. It was held on October 1st, 5 hours long.	[reply]