Re^3: Processing an encoded file backwards

Replies are listed 'Best First'.
Re^4: Processing an encoded file backwards (updated) by LanX (Saint) on Jan 18, 2020 at 21:16 UTC
> So it doesn't allow you to differentiate between a character that was broken by the read, and an actually malformed input file. providing a call back helps identifying the malformed bytes `DB<131> dd $rr "\x84\xC3\x96\xC3\x9C.\r\n\r\n" DB<132> $start=0 DB<133> $rru = Encode::decode('utf8',$rr, sub{ my $broken = shift; $ +start++; "" }); DB<134> dd $rru "\xD6\xDC.\r\n\r\n" DB<135> p $start 1 DB<136>` [download] > I don't use the debugger often, so reading its output doesn't come naturally to me ;-) as commented pp/dd are from Data::Dump Dump from Devel::Peek furthermore debugger commands p prints scalar x prints list `DB<79> h p p expr Same as "print {DB::OUT} expr" in current package. DB<80> h x x expr Evals expression in list context, dumps the result.` [download] update one way to identify how many malformed bytes are at the start and to be sure the rest is well. `DB<159> $start=0 DB<160> $rru = Encode::decode('utf8',$rr,sub{ $start++; return "" }) +; DB<161> $sub= substr $rr,$start DB<162> $rru2 = Encode::decode('utf8',$sub ,Encode::FB_CROAK); DB<163> p $rru2 eq $rru 1 DB<164>` [download] Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery FootballPerl is like chess, only without the dice}	[reply] [d/l] [select]
Re^5: Processing an encoded file backwards by haukex (Archbishop) on Jan 18, 2020 at 21:53 UTC
The Flag FB_QUIET seems to be the answer providing a call back helps identifying the malformed bytes Unfortunately, that doesn't always seem to be the case: use warnings; use strict; use Encode qw/decode/; use Data::Dump; dd decode('UTF-16-BE', "\xD8\x3D\xDD\xFA", Encode::FB_QUIET\|Encode::LE +AVE_SRC); dd decode('UTF-16-BE', "\x3D\xDD\xFA", Encode::FB_QUIET\|Encode::LEAVE_ +SRC); dd decode('UTF-16-BE', "\x3D\xDD\xFA", sub{ sprintf "<U+%04X>", shift +}); dd decode('UTF-16-LE', "\x3D\xD8\xFA\xDD", Encode::FB_QUIET\|Encode::LE +AVE_SRC); dd decode('UTF-16-LE', "\xD8\xFA\xDD", Encode::FB_QUIET\|Encode::LEAVE_ +SRC); dd decode('UTF-16-LE', "\xD8\xFA\xDD", sub{ sprintf "<U+%04X>", shift +}); __END__ "\x{1F5FA}" "\x{3DDD}" "\x{3DDD}" "\x{1F5FA}" "\x{FAD8}" "\x{FAD8}" [download] It could be argued that this is a bug / oversight in Encode, and of course if we know we're reading UTF-16 we should always read an even number of bytes. But still, because of this behavior, and because I don't yet know if there are encodings where chopped up byte sequences might end up in valid characters, I'm doubtful that a generalized "read any file backwards with Encode" is reliable. Personally I'd just make a version for UTF-8 and UTF-16, and any others as needed, or other encodings can be converted to the supported ones.	[reply] [d/l]
Re^6: Processing an encoded file backwards (updated) by LanX (Saint) on Jan 18, 2020 at 22:30 UTC
Haukex++ NICE example for ambiguity of UTF-16 if you don't get the start right!!! For clarification: your point is that `"\xFA"` (first block) and `"\xDD"` (second block) should raise errors? `use warnings; use strict; use Encode qw/decode/; use Data::Dump qw/dd/; dd decode('UTF-16-BE', "\x3D\xDD\xFA", Encode::FB_CROAK\|Encode::LEAVE_ +SRC ); dd decode('UTF-16-BE', "\xFA", Encode::FB_CROAK\|Encode::LEAVE_ +SRC ); dd decode('UTF-16-LE', "\xD8\xFA\xDD", Encode::FB_CROAK\|Encode::LEAVE_ +SRC ); dd decode('UTF-16-LE', "\xDD", Encode::FB_CROAK\|Encode::LEAVE_ +SRC); __END__ "\x{3DDD}" "" "\x{FAD8}" ""` [download] I agree, looks like a bug we should report. A really strange one too ... > Personally I'd just make a version for UTF-8 and UTF-16, and any others as needed, or other encodings can be converted to the supported ones. I don't even know other wide encodings except unicode , so I prefer relying on Encode for utf8 and make sure utf16 are read modulo ~~4 bytes~~ 2 bytes. update From the docs `As of version 2.12, "Encode" supports coderef values for "CHECK"; +see below. NOTE: Not all encodings support this feature. Some encodings ignor +e the CHECK argument. For example, Encode::Unicode ignores CHECK and + it always croaks on error.` [download] ... but it doesn't croak Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery FootballPerl is like chess, only without the dice}	[reply] [d/l] [select]
Re^7: Processing an encoded file backwards (updated) by choroba (Cardinal) on Jan 18, 2020 at 23:13 UTC
Re^8: Processing an encoded file backwards (updated) by LanX (Saint) on Jan 18, 2020 at 23:27 UTC
Re^8: Processing an encoded file backwards (updated) by LanX (Saint) on Jan 18, 2020 at 23:19 UTC
Some notes below your chosen depth have not been shown here

update

update