in reply to Re^2: Processing an encoded file backwards
in thread Processing an encoded file backwards

As you can see in my demo in the other answer is Encode using "\x{FFFD}" to decode the broken character. When it's reliable° in doing so, this could lead to better code.

Well, to be purist about it (emphasis mine):

If CHECK is 0, encoding and decoding replace any malformed character with a substitution character.

So it doesn't allow you to differentiate between a character that was broken by the read, and an actually malformed input file.

Update:

Not sure what other multi-byte encodings are out there...

Me neither, but I think UTF-8 and UTF-16 would already cover a lot of what's out there today.

As you can see in my demo in the other answer

I don't use the debugger often, so reading its output doesn't come naturally to me ;-)

Replies are listed 'Best First'.
Re^4: Processing an encoded file backwards (updated)
by LanX (Saint) on Jan 18, 2020 at 21:16 UTC
    > So it doesn't allow you to differentiate between a character that was broken by the read, and an actually malformed input file.

    providing a call back helps identifying the malformed bytes

    DB<131> dd $rr "\x84\xC3\x96\xC3\x9C.\r\n\r\n" DB<132> $start=0 DB<133> $rru = Encode::decode('utf8',$rr, sub{ my $broken = shift; $ +start++; "" }); DB<134> dd $rru "\xD6\xDC.\r\n\r\n" DB<135> p $start 1 DB<136>

    > I don't use the debugger often, so reading its output doesn't come naturally to me ;-)

    as commented

    furthermore debugger commands

    • p prints scalar
    • x prints list

    DB<79> h p p expr Same as "print {DB::OUT} expr" in current package. DB<80> h x x expr Evals expression in list context, dumps the result.

    update

    one way to identify how many malformed bytes are at the start and to be sure the rest is well.

    DB<159> $start=0 DB<160> $rru = Encode::decode('utf8',$rr,sub{ $start++; return "" }) +; DB<161> $sub= substr $rr,$start DB<162> $rru2 = Encode::decode('utf8',$sub ,Encode::FB_CROAK); DB<163> p $rru2 eq $rru 1 DB<164>

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

      The Flag FB_QUIET seems to be the answer
      providing a call back helps identifying the malformed bytes

      Unfortunately, that doesn't always seem to be the case:

      use warnings; use strict; use Encode qw/decode/; use Data::Dump; dd decode('UTF-16-BE', "\xD8\x3D\xDD\xFA", Encode::FB_QUIET|Encode::LE +AVE_SRC); dd decode('UTF-16-BE', "\x3D\xDD\xFA", Encode::FB_QUIET|Encode::LEAVE_ +SRC); dd decode('UTF-16-BE', "\x3D\xDD\xFA", sub{ sprintf "<U+%04X>", shift +}); dd decode('UTF-16-LE', "\x3D\xD8\xFA\xDD", Encode::FB_QUIET|Encode::LE +AVE_SRC); dd decode('UTF-16-LE', "\xD8\xFA\xDD", Encode::FB_QUIET|Encode::LEAVE_ +SRC); dd decode('UTF-16-LE', "\xD8\xFA\xDD", sub{ sprintf "<U+%04X>", shift +}); __END__ "\x{1F5FA}" "\x{3DDD}" "\x{3DDD}" "\x{1F5FA}" "\x{FAD8}" "\x{FAD8}"

      It could be argued that this is a bug / oversight in Encode, and of course if we know we're reading UTF-16 we should always read an even number of bytes. But still, because of this behavior, and because I don't yet know if there are encodings where chopped up byte sequences might end up in valid characters, I'm doubtful that a generalized "read any file backwards with Encode" is reliable. Personally I'd just make a version for UTF-8 and UTF-16, and any others as needed, or other encodings can be converted to the supported ones.

        Haukex++

        NICE example for ambiguity of UTF-16 if you don't get the start right!!!

        For clarification: your point is that "\xFA" (first block) and "\xDD" (second block) should raise errors?

        use warnings; use strict; use Encode qw/decode/; use Data::Dump qw/dd/; dd decode('UTF-16-BE', "\x3D\xDD\xFA", Encode::FB_CROAK|Encode::LEAVE_ +SRC ); dd decode('UTF-16-BE', "\xFA", Encode::FB_CROAK|Encode::LEAVE_ +SRC ); dd decode('UTF-16-LE', "\xD8\xFA\xDD", Encode::FB_CROAK|Encode::LEAVE_ +SRC ); dd decode('UTF-16-LE', "\xDD", Encode::FB_CROAK|Encode::LEAVE_ +SRC); __END__ "\x{3DDD}" "" "\x{FAD8}" ""

        I agree, looks like a bug we should report.

        A really strange one too ...

        > Personally I'd just make a version for UTF-8 and UTF-16, and any others as needed, or other encodings can be converted to the supported ones.

        I don't even know other wide encodings except unicode , so I prefer relying on Encode for utf8 and make sure utf16 are read modulo 4 bytes 2 bytes.

        update

        From the docs

        As of version 2.12, "Encode" supports coderef values for "CHECK"; +see below. NOTE: Not all encodings support this feature. Some encodings ignor +e the *CHECK* argument. For example, Encode::Unicode ignores *CHECK* and + it always croaks on error.

        ... but it doesn't croak

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery FootballPerl is like chess, only without the dice