Re: Processing an encoded file backwards

I've actually thought about this myself. seek, tell, sysseek, and sysread all operate on bytes, while read operates on bytes or characters depending on the I/O layers. So because we can only seek in bytes, I think the only way to approach it is to first read a chunk of bytes from the end of the file, and then look at what was read to determine whether a UTF-8 encoded character was chopped off - specifically, if the block of data begins with a byte of which the two high bits are 10xxxxxx, since that is a UTF-8 continuation byte. Discard those bytes, and you should then have a buffer that can be correctly decoded as UTF-8 and that you can inspect for how many characters it contains, how many lines, etc., depending on what you actually want your window to be counted on. So I took this opportunity to finally express my idea in code :-)

sub readbackwards_utf8 { # returns an iterator
    my ($fn, $window) = @_;
    die "Bad window $window" unless $window>=4;
    open my $fh, '<:raw', $fn or die "open $fn: $!";
    my $curpos = -s $fh;
    return sub {
        if ( $curpos<1 ) { close $fh if $fh; $fh=undef; return }
        my $bytes = $curpos < $window ? $curpos : $window;
        seek($fh, $curpos-=$bytes, 0) or die "seek $curpos $fn: $!";
        read($fh, my $buf, $bytes) == $bytes
            or die "read $bytes bytes at $curpos from $fn: $!";
        while ( (ord(substr $buf, 0, 1) & 0b11000000)==0b10000000 )
            { $buf = substr $buf, 1; $curpos++ }
        utf8::decode($buf);
        return $buf;
    }
}
[download]

It would be pretty easy to wrap the iterator which the above code returns into another iterator that counts characters and lines, and returns chunks of that size. Of course, this is specific to UTF-8. For encodings with a fixed width, like UTF-16 or UTF-32, it would be somewhat easier.

use open qw/:std :utf8/;
use Test::More;
use File::Temp qw/tempfile/;

my ($tempfh, $filename) = tempfile( UNLINK => 1 );
binmode $tempfh, ':encoding(UTF-8)';
print $tempfh "H\N{U+20AC}ll\N{U+00F6}, \N{U+1F5FA}!\n";
close $tempfh;
#system('hexdump','-C',$filename);

my $four = readbackwards_utf8($filename, 4);
is $four->(), "!\n";
is $four->(), "\N{U+1F5FA}";
is $four->(), "\N{U+00F6}, ";
is $four->(), "ll";
is $four->(), "H\N{U+20AC}";
is $four->(), undef;
is $four->(), undef;

my $five = readbackwards_utf8($filename, 5);
is $five->(), "!\n";
is $five->(), " \N{U+1F5FA}";
is $five->(), "ll\N{U+00F6},";
is $five->(), "H\N{U+20AC}";
is $five->(), undef;

my $six = readbackwards_utf8($filename, 6);
is $six->(), "\N{U+1F5FA}!\n";
is $six->(), "ll\N{U+00F6}, ";
is $six->(), "H\N{U+20AC}";
is $six->(), undef;

my $seven = readbackwards_utf8($filename, 7);
is $seven->(), " \N{U+1F5FA}!\n";
is $seven->(), "ll\N{U+00F6},";
is $seven->(), "H\N{U+20AC}";
is $seven->(), undef;

for my $n (8..9) {
    my $eight = readbackwards_utf8($filename, $n);
    is $eight->(), ", \N{U+1F5FA}!\n";
    is $eight->(), "H\N{U+20AC}ll\N{U+00F6}";
    is $eight->(), undef;
}

my $ten = readbackwards_utf8($filename, 10);
is $ten->(), "\N{U+00F6}, \N{U+1F5FA}!\n";
is $ten->(), "H\N{U+20AC}ll";
is $ten->(), undef;

my $eleven = readbackwards_utf8($filename, 11);
is $eleven->(), "l\N{U+00F6}, \N{U+1F5FA}!\n";
is $eleven->(), "H\N{U+20AC}l";
is $eleven->(), undef;

for my $n (12..14) {
    my $twelve = readbackwards_utf8($filename, $n);
    is $twelve->(), "ll\N{U+00F6}, \N{U+1F5FA}!\n";
    is $twelve->(), "H\N{U+20AC}";
    is $twelve->(), undef;
}

my $fifteen = readbackwards_utf8($filename, 15);
is $fifteen->(), "\N{U+20AC}ll\N{U+00F6}, \N{U+1F5FA}!\n";
is $fifteen->(), "H";
is $fifteen->(), undef;

for my $n (16..17) {
    my $sixteen = readbackwards_utf8($filename, 16);
    is $sixteen->(), "H\N{U+20AC}ll\N{U+00F6}, \N{U+1F5FA}!\n";
    is $sixteen->(), undef;
}

done_testing;
[download]

Comment on Re: Processing an encoded file backwards Select or Download Code

Replies are listed 'Best First'.
Re^2: Processing an encoded file backwards by LanX (Saint) on Jan 18, 2020 at 20:41 UTC
sure this is the basic approach for UTF8. I was hoping for a more elegant solution and generic solution using Encode As you can see in my demo in the other answer is Encode using "\x{FFFD}" to decode the broken character. When it's reliable° in doing so, this could lead to better code. Not sure what other multi-byte encodings are out there... Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery FootballPerl is like chess, only without the dice} °) It is: from Encode If CHECK is 0, encoding and decoding replace any malformed character with a substitution character. When you encode, SUBCHAR is used. When you decode, the Unicode REPLACEMENT CHARACTER, code point U+FFFD, is used. If the data is supposed to be UTF-8, an optional lexical warning of warning category "utf8" is given. update The Flag FB_QUIET seems to be the answer	[reply]
Re^3: Processing an encoded file backwards by haukex (Archbishop) on Jan 18, 2020 at 20:56 UTC
As you can see in my demo in the other answer is Encode using "\x{FFFD}" to decode the broken character. When it's reliable° in doing so, this could lead to better code. Well, to be purist about it (emphasis mine): If CHECK is 0, encoding and decoding replace any* malformed character with a substitution character.* So it doesn't allow you to differentiate between a character that was broken by the read, and an actually malformed input file. Update: Not sure what other multi-byte encodings are out there... Me neither, but I think UTF-8 and UTF-16 would already cover a lot of what's out there today. As you can see in my demo in the other answer I don't use the debugger often, so reading its output doesn't come naturally to me `;-)`	[reply] [d/l]
Re^4: Processing an encoded file backwards (updated) by LanX (Saint) on Jan 18, 2020 at 21:16 UTC
> So it doesn't allow you to differentiate between a character that was broken by the read, and an actually malformed input file. providing a call back helps identifying the malformed bytes `DB<131> dd $rr "\x84\xC3\x96\xC3\x9C.\r\n\r\n" DB<132> $start=0 DB<133> $rru = Encode::decode('utf8',$rr, sub{ my $broken = shift; $ +start++; "" }); DB<134> dd $rru "\xD6\xDC.\r\n\r\n" DB<135> p $start 1 DB<136>` [download] > I don't use the debugger often, so reading its output doesn't come naturally to me ;-) as commented pp/dd are from Data::Dump Dump from Devel::Peek furthermore debugger commands p prints scalar x prints list `DB<79> h p p expr Same as "print {DB::OUT} expr" in current package. DB<80> h x x expr Evals expression in list context, dumps the result.` [download] update one way to identify how many malformed bytes are at the start and to be sure the rest is well. `DB<159> $start=0 DB<160> $rru = Encode::decode('utf8',$rr,sub{ $start++; return "" }) +; DB<161> $sub= substr $rr,$start DB<162> $rru2 = Encode::decode('utf8',$sub ,Encode::FB_CROAK); DB<163> p $rru2 eq $rru 1 DB<164>` [download] Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery FootballPerl is like chess, only without the dice}	[reply] [d/l] [select]
Re^5: Processing an encoded file backwards by haukex (Archbishop) on Jan 18, 2020 at 21:53 UTC
Re^6: Processing an encoded file backwards (updated) by LanX (Saint) on Jan 18, 2020 at 22:30 UTC
Some notes below your chosen depth have not been shown here

update

update