in reply to Re^4: Processing an encoded file backwards (updated)
in thread Processing an encoded file backwards

The Flag FB_QUIET seems to be the answer
providing a call back helps identifying the malformed bytes

Unfortunately, that doesn't always seem to be the case:

use warnings; use strict; use Encode qw/decode/; use Data::Dump; dd decode('UTF-16-BE', "\xD8\x3D\xDD\xFA", Encode::FB_QUIET|Encode::LE +AVE_SRC); dd decode('UTF-16-BE', "\x3D\xDD\xFA", Encode::FB_QUIET|Encode::LEAVE_ +SRC); dd decode('UTF-16-BE', "\x3D\xDD\xFA", sub{ sprintf "<U+%04X>", shift +}); dd decode('UTF-16-LE', "\x3D\xD8\xFA\xDD", Encode::FB_QUIET|Encode::LE +AVE_SRC); dd decode('UTF-16-LE', "\xD8\xFA\xDD", Encode::FB_QUIET|Encode::LEAVE_ +SRC); dd decode('UTF-16-LE', "\xD8\xFA\xDD", sub{ sprintf "<U+%04X>", shift +}); __END__ "\x{1F5FA}" "\x{3DDD}" "\x{3DDD}" "\x{1F5FA}" "\x{FAD8}" "\x{FAD8}"

It could be argued that this is a bug / oversight in Encode, and of course if we know we're reading UTF-16 we should always read an even number of bytes. But still, because of this behavior, and because I don't yet know if there are encodings where chopped up byte sequences might end up in valid characters, I'm doubtful that a generalized "read any file backwards with Encode" is reliable. Personally I'd just make a version for UTF-8 and UTF-16, and any others as needed, or other encodings can be converted to the supported ones.

Replies are listed 'Best First'.
Re^6: Processing an encoded file backwards (updated)
by LanX (Saint) on Jan 18, 2020 at 22:30 UTC
    Haukex++

    NICE example for ambiguity of UTF-16 if you don't get the start right!!!

    For clarification: your point is that "\xFA" (first block) and "\xDD" (second block) should raise errors?

    use warnings; use strict; use Encode qw/decode/; use Data::Dump qw/dd/; dd decode('UTF-16-BE', "\x3D\xDD\xFA", Encode::FB_CROAK|Encode::LEAVE_ +SRC ); dd decode('UTF-16-BE', "\xFA", Encode::FB_CROAK|Encode::LEAVE_ +SRC ); dd decode('UTF-16-LE', "\xD8\xFA\xDD", Encode::FB_CROAK|Encode::LEAVE_ +SRC ); dd decode('UTF-16-LE', "\xDD", Encode::FB_CROAK|Encode::LEAVE_ +SRC); __END__ "\x{3DDD}" "" "\x{FAD8}" ""

    I agree, looks like a bug we should report.

    A really strange one too ...

    > Personally I'd just make a version for UTF-8 and UTF-16, and any others as needed, or other encodings can be converted to the supported ones.

    I don't even know other wide encodings except unicode , so I prefer relying on Encode for utf8 and make sure utf16 are read modulo 4 bytes 2 bytes.

    update

    From the docs

    As of version 2.12, "Encode" supports coderef values for "CHECK"; +see below. NOTE: Not all encodings support this feature. Some encodings ignor +e the *CHECK* argument. For example, Encode::Unicode ignores *CHECK* and + it always croaks on error.

    ... but it doesn't croak

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

      What Perl version? I'm getting

      UTF-16BE:Partial character at /home/choroba/1.pl line 8.
      in 5.26.1.

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
        That's what I get in git-bash with 5.26.2

        probably a windows problem?

        $ perl use warnings; use strict; use Encode qw/decode/; use Data::Dumper qw/Dumper/; warn Dumper decode('UTF-16-BE', "\x3D\xDD\xFA", Encode::FB_CROAK|Encod +e::LEAVE_SRC ); warn Dumper decode('UTF-16-BE', "\xFA", Encode::FB_CROAK|Encod +e::LEAVE_SRC ); warn Dumper decode('UTF-16-LE', "\xD8\xFA\xDD", Encode::FB_CROAK|Encod +e::LEAVE_SRC ); warn Dumper decode('UTF-16-LE', "\xDD", Encode::FB_CROAK|Encod +e::LEAVE_SRC); __END__ $VAR1 = "\x{3ddd}"; $VAR1 = ''; $VAR1 = "\x{fad8}"; $VAR1 = ''; MINGW64 ~ $ perl -v This is perl 5, version 26, subversion 2 (v5.26.2) built for x86_64-ms +ys-thread-multi

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

        > What Perl version?

        'This is perl 5, version 24, subversion 1 (v5.24.1) built for MSWin32-x64-multi-thread'

        does it die?

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery FootballPerl is like chess, only without the dice