in reply to Re^3: Processing an encoded file backwards
in thread Processing an encoded file backwards

> So it doesn't allow you to differentiate between a character that was broken by the read, and an actually malformed input file.

providing a call back helps identifying the malformed bytes

DB<131> dd $rr "\x84\xC3\x96\xC3\x9C.\r\n\r\n" DB<132> $start=0 DB<133> $rru = Encode::decode('utf8',$rr, sub{ my $broken = shift; $ +start++; "" }); DB<134> dd $rru "\xD6\xDC.\r\n\r\n" DB<135> p $start 1 DB<136>

> I don't use the debugger often, so reading its output doesn't come naturally to me ;-)

as commented

furthermore debugger commands

DB<79> h p p expr Same as "print {DB::OUT} expr" in current package. DB<80> h x x expr Evals expression in list context, dumps the result.

update

one way to identify how many malformed bytes are at the start and to be sure the rest is well.

DB<159> $start=0 DB<160> $rru = Encode::decode('utf8',$rr,sub{ $start++; return "" }) +; DB<161> $sub= substr $rr,$start DB<162> $rru2 = Encode::decode('utf8',$sub ,Encode::FB_CROAK); DB<163> p $rru2 eq $rru 1 DB<164>

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

Replies are listed 'Best First'.
Re^5: Processing an encoded file backwards
by haukex (Archbishop) on Jan 18, 2020 at 21:53 UTC
    The Flag FB_QUIET seems to be the answer
    providing a call back helps identifying the malformed bytes

    Unfortunately, that doesn't always seem to be the case:

    use warnings; use strict; use Encode qw/decode/; use Data::Dump; dd decode('UTF-16-BE', "\xD8\x3D\xDD\xFA", Encode::FB_QUIET|Encode::LE +AVE_SRC); dd decode('UTF-16-BE', "\x3D\xDD\xFA", Encode::FB_QUIET|Encode::LEAVE_ +SRC); dd decode('UTF-16-BE', "\x3D\xDD\xFA", sub{ sprintf "<U+%04X>", shift +}); dd decode('UTF-16-LE', "\x3D\xD8\xFA\xDD", Encode::FB_QUIET|Encode::LE +AVE_SRC); dd decode('UTF-16-LE', "\xD8\xFA\xDD", Encode::FB_QUIET|Encode::LEAVE_ +SRC); dd decode('UTF-16-LE', "\xD8\xFA\xDD", sub{ sprintf "<U+%04X>", shift +}); __END__ "\x{1F5FA}" "\x{3DDD}" "\x{3DDD}" "\x{1F5FA}" "\x{FAD8}" "\x{FAD8}"

    It could be argued that this is a bug / oversight in Encode, and of course if we know we're reading UTF-16 we should always read an even number of bytes. But still, because of this behavior, and because I don't yet know if there are encodings where chopped up byte sequences might end up in valid characters, I'm doubtful that a generalized "read any file backwards with Encode" is reliable. Personally I'd just make a version for UTF-8 and UTF-16, and any others as needed, or other encodings can be converted to the supported ones.

      Haukex++

      NICE example for ambiguity of UTF-16 if you don't get the start right!!!

      For clarification: your point is that "\xFA" (first block) and "\xDD" (second block) should raise errors?

      use warnings; use strict; use Encode qw/decode/; use Data::Dump qw/dd/; dd decode('UTF-16-BE', "\x3D\xDD\xFA", Encode::FB_CROAK|Encode::LEAVE_ +SRC ); dd decode('UTF-16-BE', "\xFA", Encode::FB_CROAK|Encode::LEAVE_ +SRC ); dd decode('UTF-16-LE', "\xD8\xFA\xDD", Encode::FB_CROAK|Encode::LEAVE_ +SRC ); dd decode('UTF-16-LE', "\xDD", Encode::FB_CROAK|Encode::LEAVE_ +SRC); __END__ "\x{3DDD}" "" "\x{FAD8}" ""

      I agree, looks like a bug we should report.

      A really strange one too ...

      > Personally I'd just make a version for UTF-8 and UTF-16, and any others as needed, or other encodings can be converted to the supported ones.

      I don't even know other wide encodings except unicode , so I prefer relying on Encode for utf8 and make sure utf16 are read modulo 4 bytes 2 bytes.

      update

      From the docs

      As of version 2.12, "Encode" supports coderef values for "CHECK"; +see below. NOTE: Not all encodings support this feature. Some encodings ignor +e the *CHECK* argument. For example, Encode::Unicode ignores *CHECK* and + it always croaks on error.

      ... but it doesn't croak

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

        What Perl version? I'm getting

        UTF-16BE:Partial character at /home/choroba/1.pl line 8.
        in 5.26.1.

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]