Re^4: Processing an encoded file backwards (updated)

> So it doesn't allow you to differentiate between a character that was broken by the read, and an actually malformed input file.

providing a call back helps identifying the malformed bytes

  DB<131> dd $rr
"\x84\xC3\x96\xC3\x9C.\r\n\r\n"

  DB<132> $start=0

  DB<133> $rru = Encode::decode('utf8',$rr, sub{ my $broken = shift; $
+start++; "" });

  DB<134> dd $rru
"\xD6\xDC.\r\n\r\n"

  DB<135> p $start
1
  DB<136>
[download]

> I don't use the debugger often, so reading its output doesn't come naturally to me ;-)

as commented

pp/dd are from Data::Dump
Dump from Devel::Peek

furthermore debugger commands

p prints scalar
x prints list

 DB<79> h p
p expr        Same as "print {DB::OUT} expr" in current package.

  DB<80> h x
x expr        Evals expression in list context, dumps the result.
[download]

update

one way to identify how many malformed bytes are at the start and to be sure the rest is well.

  DB<159> $start=0

  DB<160> $rru = Encode::decode('utf8',$rr,sub{ $start++; return "" })
+;

  DB<161> $sub= substr $rr,$start

  DB<162> $rru2 = Encode::decode('utf8',$sub ,Encode::FB_CROAK);

  DB<163> p $rru2 eq $rru
1
  DB<164>
[download]

Cheers Rolf
_{(addicted to the Perl Programming Language :)

Wikisyntax for the Monastery
FootballPerl is like chess, only without the dice}

Comment on Re^4: Processing an encoded file backwards (updated) Select or Download Code

Replies are listed 'Best First'.
Re^5: Processing an encoded file backwards by haukex (Archbishop) on Jan 18, 2020 at 21:53 UTC
The Flag FB_QUIET seems to be the answer providing a call back helps identifying the malformed bytes Unfortunately, that doesn't always seem to be the case: use warnings; use strict; use Encode qw/decode/; use Data::Dump; dd decode('UTF-16-BE', "\xD8\x3D\xDD\xFA", Encode::FB_QUIET\|Encode::LE +AVE_SRC); dd decode('UTF-16-BE', "\x3D\xDD\xFA", Encode::FB_QUIET\|Encode::LEAVE_ +SRC); dd decode('UTF-16-BE', "\x3D\xDD\xFA", sub{ sprintf "<U+%04X>", shift +}); dd decode('UTF-16-LE', "\x3D\xD8\xFA\xDD", Encode::FB_QUIET\|Encode::LE +AVE_SRC); dd decode('UTF-16-LE', "\xD8\xFA\xDD", Encode::FB_QUIET\|Encode::LEAVE_ +SRC); dd decode('UTF-16-LE', "\xD8\xFA\xDD", sub{ sprintf "<U+%04X>", shift +}); __END__ "\x{1F5FA}" "\x{3DDD}" "\x{3DDD}" "\x{1F5FA}" "\x{FAD8}" "\x{FAD8}" [download] It could be argued that this is a bug / oversight in Encode, and of course if we know we're reading UTF-16 we should always read an even number of bytes. But still, because of this behavior, and because I don't yet know if there are encodings where chopped up byte sequences might end up in valid characters, I'm doubtful that a generalized "read any file backwards with Encode" is reliable. Personally I'd just make a version for UTF-8 and UTF-16, and any others as needed, or other encodings can be converted to the supported ones.	[reply] [d/l]
Re^6: Processing an encoded file backwards (updated) by LanX (Saint) on Jan 18, 2020 at 22:30 UTC
Haukex++ NICE example for ambiguity of UTF-16 if you don't get the start right!!! For clarification: your point is that `"\xFA"` (first block) and `"\xDD"` (second block) should raise errors? `use warnings; use strict; use Encode qw/decode/; use Data::Dump qw/dd/; dd decode('UTF-16-BE', "\x3D\xDD\xFA", Encode::FB_CROAK\|Encode::LEAVE_ +SRC ); dd decode('UTF-16-BE', "\xFA", Encode::FB_CROAK\|Encode::LEAVE_ +SRC ); dd decode('UTF-16-LE', "\xD8\xFA\xDD", Encode::FB_CROAK\|Encode::LEAVE_ +SRC ); dd decode('UTF-16-LE', "\xDD", Encode::FB_CROAK\|Encode::LEAVE_ +SRC); __END__ "\x{3DDD}" "" "\x{FAD8}" ""` [download] I agree, looks like a bug we should report. A really strange one too ... > Personally I'd just make a version for UTF-8 and UTF-16, and any others as needed, or other encodings can be converted to the supported ones. I don't even know other wide encodings except unicode , so I prefer relying on Encode for utf8 and make sure utf16 are read modulo ~~4 bytes~~ 2 bytes. update From the docs `As of version 2.12, "Encode" supports coderef values for "CHECK"; +see below. NOTE: Not all encodings support this feature. Some encodings ignor +e the CHECK argument. For example, Encode::Unicode ignores CHECK and + it always croaks on error.` [download] ... but it doesn't croak Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery FootballPerl is like chess, only without the dice}	[reply] [d/l] [select]
Re^7: Processing an encoded file backwards (updated) by choroba (Cardinal) on Jan 18, 2020 at 23:13 UTC
What Perl version? I'm getting `UTF-16BE:Partial character at /home/choroba/1.pl line 8.` [download] in 5.26.1. `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re^8: Processing an encoded file backwards (updated) by LanX (Saint) on Jan 18, 2020 at 23:27 UTC
Re^8: Processing an encoded file backwards (updated) by LanX (Saint) on Jan 18, 2020 at 23:19 UTC
Re^9: Processing an encoded file backwards (updated) by choroba (Cardinal) on Jan 18, 2020 at 23:46 UTC
Some notes below your chosen depth have not been shown here
Re^9: Processing an encoded file backwards by haukex (Archbishop) on Jan 19, 2020 at 12:05 UTC
Some notes below your chosen depth have not been shown here