in reply to Re: Processing an encoded file backwards
in thread Processing an encoded file backwards

sure this is the basic approach for UTF8.

I was hoping for a more elegant solution and generic solution using Encode

As you can see in my demo in the other answer is Encode using "\x{FFFD}" to decode the broken character.

When it's reliable° in doing so, this could lead to better code.

Not sure what other multi-byte encodings are out there...

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

°) It is: from Encode If CHECK is 0, encoding and decoding replace any malformed character with a substitution character. When you encode, SUBCHAR is used. When you decode, the Unicode REPLACEMENT CHARACTER, code point U+FFFD, is used. If the data is supposed to be UTF-8, an optional lexical warning of warning category "utf8" is given.

update

The Flag FB_QUIET seems to be the answer

  • Comment on Re^2: Processing an encoded file backwards

Replies are listed 'Best First'.
Re^3: Processing an encoded file backwards
by haukex (Archbishop) on Jan 18, 2020 at 20:56 UTC
    As you can see in my demo in the other answer is Encode using "\x{FFFD}" to decode the broken character. When it's reliable° in doing so, this could lead to better code.

    Well, to be purist about it (emphasis mine):

    If CHECK is 0, encoding and decoding replace any malformed character with a substitution character.

    So it doesn't allow you to differentiate between a character that was broken by the read, and an actually malformed input file.

    Update:

    Not sure what other multi-byte encodings are out there...

    Me neither, but I think UTF-8 and UTF-16 would already cover a lot of what's out there today.

    As you can see in my demo in the other answer

    I don't use the debugger often, so reading its output doesn't come naturally to me ;-)

      > So it doesn't allow you to differentiate between a character that was broken by the read, and an actually malformed input file.

      providing a call back helps identifying the malformed bytes

      DB<131> dd $rr "\x84\xC3\x96\xC3\x9C.\r\n\r\n" DB<132> $start=0 DB<133> $rru = Encode::decode('utf8',$rr, sub{ my $broken = shift; $ +start++; "" }); DB<134> dd $rru "\xD6\xDC.\r\n\r\n" DB<135> p $start 1 DB<136>

      > I don't use the debugger often, so reading its output doesn't come naturally to me ;-)

      as commented

      furthermore debugger commands

      • p prints scalar
      • x prints list

      DB<79> h p p expr Same as "print {DB::OUT} expr" in current package. DB<80> h x x expr Evals expression in list context, dumps the result.

      update

      one way to identify how many malformed bytes are at the start and to be sure the rest is well.

      DB<159> $start=0 DB<160> $rru = Encode::decode('utf8',$rr,sub{ $start++; return "" }) +; DB<161> $sub= substr $rr,$start DB<162> $rru2 = Encode::decode('utf8',$sub ,Encode::FB_CROAK); DB<163> p $rru2 eq $rru 1 DB<164>

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

        The Flag FB_QUIET seems to be the answer
        providing a call back helps identifying the malformed bytes

        Unfortunately, that doesn't always seem to be the case:

        use warnings; use strict; use Encode qw/decode/; use Data::Dump; dd decode('UTF-16-BE', "\xD8\x3D\xDD\xFA", Encode::FB_QUIET|Encode::LE +AVE_SRC); dd decode('UTF-16-BE', "\x3D\xDD\xFA", Encode::FB_QUIET|Encode::LEAVE_ +SRC); dd decode('UTF-16-BE', "\x3D\xDD\xFA", sub{ sprintf "<U+%04X>", shift +}); dd decode('UTF-16-LE', "\x3D\xD8\xFA\xDD", Encode::FB_QUIET|Encode::LE +AVE_SRC); dd decode('UTF-16-LE', "\xD8\xFA\xDD", Encode::FB_QUIET|Encode::LEAVE_ +SRC); dd decode('UTF-16-LE', "\xD8\xFA\xDD", sub{ sprintf "<U+%04X>", shift +}); __END__ "\x{1F5FA}" "\x{3DDD}" "\x{3DDD}" "\x{1F5FA}" "\x{FAD8}" "\x{FAD8}"

        It could be argued that this is a bug / oversight in Encode, and of course if we know we're reading UTF-16 we should always read an even number of bytes. But still, because of this behavior, and because I don't yet know if there are encodings where chopped up byte sequences might end up in valid characters, I'm doubtful that a generalized "read any file backwards with Encode" is reliable. Personally I'd just make a version for UTF-8 and UTF-16, and any others as needed, or other encodings can be converted to the supported ones.