Re^2: Processing an encoded file backwards

sure this is the basic approach for UTF8.

I was hoping for a more elegant solution and generic solution using Encode

As you can see in my demo in the other answer is Encode using "\x{FFFD}" to decode the broken character.

When it's reliable° in doing so, this could lead to better code.

Not sure what other multi-byte encodings are out there...

Cheers Rolf
_{(addicted to the Perl Programming Language :)

Wikisyntax for the Monastery
FootballPerl is like chess, only without the dice}

°) It is: from Encode If CHECK is 0, encoding and decoding replace any malformed character with a substitution character. When you encode, SUBCHAR is used. When you decode, the Unicode REPLACEMENT CHARACTER, code point U+FFFD, is used. If the data is supposed to be UTF-8, an optional lexical warning of warning category "utf8" is given.

update

The Flag FB_QUIET seems to be the answer

Comment on Re^2: Processing an encoded file backwards

Replies are listed 'Best First'.
Re^3: Processing an encoded file backwards by haukex (Archbishop) on Jan 18, 2020 at 20:56 UTC
As you can see in my demo in the other answer is Encode using "\x{FFFD}" to decode the broken character. When it's reliable° in doing so, this could lead to better code. Well, to be purist about it (emphasis mine): If CHECK is 0, encoding and decoding replace any* malformed character with a substitution character.* So it doesn't allow you to differentiate between a character that was broken by the read, and an actually malformed input file. Update: Not sure what other multi-byte encodings are out there... Me neither, but I think UTF-8 and UTF-16 would already cover a lot of what's out there today. As you can see in my demo in the other answer I don't use the debugger often, so reading its output doesn't come naturally to me `;-)`	[reply] [d/l]
Re^4: Processing an encoded file backwards (updated) by LanX (Saint) on Jan 18, 2020 at 21:16 UTC
> So it doesn't allow you to differentiate between a character that was broken by the read, and an actually malformed input file. providing a call back helps identifying the malformed bytes `DB<131> dd $rr "\x84\xC3\x96\xC3\x9C.\r\n\r\n" DB<132> $start=0 DB<133> $rru = Encode::decode('utf8',$rr, sub{ my $broken = shift; $ +start++; "" }); DB<134> dd $rru "\xD6\xDC.\r\n\r\n" DB<135> p $start 1 DB<136>` [download] > I don't use the debugger often, so reading its output doesn't come naturally to me ;-) as commented pp/dd are from Data::Dump Dump from Devel::Peek furthermore debugger commands p prints scalar x prints list `DB<79> h p p expr Same as "print {DB::OUT} expr" in current package. DB<80> h x x expr Evals expression in list context, dumps the result.` [download] update one way to identify how many malformed bytes are at the start and to be sure the rest is well. `DB<159> $start=0 DB<160> $rru = Encode::decode('utf8',$rr,sub{ $start++; return "" }) +; DB<161> $sub= substr $rr,$start DB<162> $rru2 = Encode::decode('utf8',$sub ,Encode::FB_CROAK); DB<163> p $rru2 eq $rru 1 DB<164>` [download] Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery FootballPerl is like chess, only without the dice}	[reply] [d/l] [select]
Re^5: Processing an encoded file backwards by haukex (Archbishop) on Jan 18, 2020 at 21:53 UTC
The Flag FB_QUIET seems to be the answer providing a call back helps identifying the malformed bytes Unfortunately, that doesn't always seem to be the case: use warnings; use strict; use Encode qw/decode/; use Data::Dump; dd decode('UTF-16-BE', "\xD8\x3D\xDD\xFA", Encode::FB_QUIET\|Encode::LE +AVE_SRC); dd decode('UTF-16-BE', "\x3D\xDD\xFA", Encode::FB_QUIET\|Encode::LEAVE_ +SRC); dd decode('UTF-16-BE', "\x3D\xDD\xFA", sub{ sprintf "<U+%04X>", shift +}); dd decode('UTF-16-LE', "\x3D\xD8\xFA\xDD", Encode::FB_QUIET\|Encode::LE +AVE_SRC); dd decode('UTF-16-LE', "\xD8\xFA\xDD", Encode::FB_QUIET\|Encode::LEAVE_ +SRC); dd decode('UTF-16-LE', "\xD8\xFA\xDD", sub{ sprintf "<U+%04X>", shift +}); __END__ "\x{1F5FA}" "\x{3DDD}" "\x{3DDD}" "\x{1F5FA}" "\x{FAD8}" "\x{FAD8}" [download] It could be argued that this is a bug / oversight in Encode, and of course if we know we're reading UTF-16 we should always read an even number of bytes. But still, because of this behavior, and because I don't yet know if there are encodings where chopped up byte sequences might end up in valid characters, I'm doubtful that a generalized "read any file backwards with Encode" is reliable. Personally I'd just make a version for UTF-8 and UTF-16, and any others as needed, or other encodings can be converted to the supported ones.	[reply] [d/l]
Re^6: Processing an encoded file backwards (updated) by LanX (Saint) on Jan 18, 2020 at 22:30 UTC
Re^7: Processing an encoded file backwards (updated) by choroba (Cardinal) on Jan 18, 2020 at 23:13 UTC
Some notes below your chosen depth have not been shown here