Question about Encode module and CHECK parameter

Nocturnus has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

After reading

http://perldoc.perl.org/Encode.html#Handling-Malformed-Data

I still have problems understanding how the CHECK parameter for the encode and decode subroutines works. The following questions are ALL related to encoding to UTF-8 and decoding from UTF-8 (I won't use other encodings in the future).

First, what sense does this parameter make when encoding to UTF-8? Are there characters which could occur in perl strings and which could not be encoded in UTF-8? Probably there are, because otherwise the CHECK parameter for the encode function didn't make sense, did it?

Second, if I use FB_DEFAULT for the CHECK parameter in encode, what is SUBCHAR?

Third, I am understanding the code example which is given in the explanation of FB_QUIET as far as it concerns valid input streams. But what if the input data not only gets fragmented by reading chunks of fixed size (this would be correctly fixed by the example code), but actually contains invalid bytes? In this case, $buffer would contain the portion starting with the invalid byte; in the next loop run, the invalid byte again would not be processed (because it is invalid), thus leaving $buffer as is. This would lead to an infinite loop, wouldn't it?

Fourth, is the following statement true?

"If I make a perl string from an input stream of octets using decode and then make an output stream of octets from that perl string using encode, then encode will never run into invalid characters *regardless* of which constant for CHECK I had used when *decoding*."

(I am aware of that the output stream might be different from the input stream, but that is not the question).

Thank you very much,

Nocturnus

Comment on Question about Encode module and CHECK parameter

Replies are listed 'Best First'.
Re: Question about Encode module and CHECK parameter by afoken (Chancellor) on Aug 07, 2015 at 17:57 UTC
Is this related to Unicode surrogate is illegal in UTF-8? Did you also read perlunitut and perlunifaq linked from Encode? Question 1: Yes, perl can use a much larger set of characters than Unicode defines. See What's the difference between UTF 8 and utf8? and UTF 8 vs. utf8 vs. UTF8. Question 2: This can be answered by reading the docs again: If CHECK is 0, encoding and decoding replace any malformed character with a substitution character. When you encode, SUBCHAR is used. (Emphasis mine) Alexander Update 2015-08-10: fixed last link, thanks to soonix -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply]
Re^2: Question about Encode module and CHECK parameter by Nocturnus (Scribe) on Aug 08, 2015 at 13:46 UTC
Thanks for bothering. My questions were not related to the surrogate characters, but more general (at least I think so). Indeed, I already had read all the documents you mentioned, but obviously, that has been too long ago. So, I am glad about your answer for question 1, but I still have problems with question 2. What is that SUBCHAR? I tried to do something like `perl -e 'use warnings; use Encode; print SUBCHAR;'` [download] but that did not work. I got warnings in every combination I could think of, e.g. `Encode::SUBCHAR $SUBCHAR` [download] and so on. Could somebody please explain what that SUBCHAR actually is by default, how to print its current value and perhaps how to change it? Thank you very much, Nocturnus	[reply] [d/l] [select]
Re^3: Question about Encode module and CHECK parameter by afoken (Chancellor) on Aug 09, 2015 at 08:20 UTC
The Encode documentation has it all, but it is a little bit hard to understand. It seems to be written by someone who deeply knows the implementation, which makes it hard to explain the interface. I looked at the source, no SUBCHAR in Encode.pm, but 13 hits in Encode.xs: (Ignore uppper and lower case when comparing with the documentation, and don't get confused by the different prefixes for some constants. Just look for names you know from the documentation. And despite the file is named Encode.xs, this ist just C code with some precompiler macros.) /* encoding / //// ... if (check & (ENCODE_PERLQQ\|ENCODE_HTMLCREF\|ENCODE_XMLCREF)){ SV subchar = (fallback_cb != &PL_sv_undef) ? do_fallback_cb(aTHX_ ch, fallback_cb) : newSVpvf(check & ENCODE_PERLQQ ? "\\x{%04"UVxf"}" : check & ENCODE_HTMLCREF ? "&#%" UVuf ";" : "&#x%" UVxf ";", (UV)ch); SvUTF8_off(subchar); /* make sure no decoded string gets in / sdone += slen + clen; ddone += dlen + SvCUR(subchar); sv_catsv(dst, subchar); SvREFCNT_dec(subchar); } else { / fallback char / sdone += slen + clen; ddone += dlen + enc->replen; sv_catpvn(dst, (char)enc->rep, enc->replen); } [download] and `/* decoding / //// ... if (check & (ENCODE_PERLQQ\|ENCODE_HTMLCREF\|ENCODE_XMLCREF)){ SV subchar = (fallback_cb != &PL_sv_undef) ? do_fallback_cb(aTHX_ (UV)s[slen], fallback_cb) : newSVpvf("\\x%02" UVXf, (UV)s[slen]); sdone += slen + 1; ddone += dlen + SvCUR(subchar); sv_catsv(dst, subchar); SvREFCNT_dec(subchar); } else { sdone += slen + 1; ddone += dlen + strlen(FBCHAR_UTF8); sv_catpv(dst, FBCHAR_UTF8); }` [download] and `malformed: //// ... if (check & (ENCODE_PERLQQ\|ENCODE_HTMLCREF\|ENCODE_XMLCREF)){ SV* subchar = (fallback_cb != &PL_sv_undef) ? do_fallback_cb(aTHX_ uv, fallback_cb) : newSVpvf(check & ENCODE_PERLQQ ? (ulen == 1 ? "\\x%02" UVXf : "\\x{%04" UVXf "}") : check & ENCODE_HTMLCREF ? "&#%" UVuf ";" : "&#x%" UVxf ";", uv); if (encode){ SvUTF8_off(subchar); /* make sure no decoded string gets in / } sv_catsv(dst, subchar); SvREFCNT_dec(subchar); } else { sv_catpv(dst, FBCHAR_UTF8); }` [download] I didn't attempt to fully understand what this code does. But it is quite obvious that the various constants for CHECK (`FB_PERLQQ`, `FB_HTMLCREF`, `FB_XMLCREF` in perl, the same with an `ENCODE_` prefix instead of `FB_` in XS) select how a malformed character is replaced. SUBCHAR is an unfortunate name, it is a substitute FOR a character, not A substitute character. In fact, it is a string, existing only as a local variable in XS. You can't access it from Perl. But there is another hint: `fallback_cb`, a callback function, called whenever a substitute for a malformed character is needed. This is what happens when CHECK is a code reference. Read coderef for CHECK: coderef for CHECK As of `Encode` 2.12, `CHECK` can also be a code reference which takes the ordinal value of the unmapped character as an argument and returns octets that represent the fallback character. ... Even the fallback for `decode` must return octets, which are then decoded with the character encoding that `decode` accepts. "Octets" are just what everyone else (except for the french) calls bytes. Encode uses the name "byte" for something different, "A character in the range 0..255; a special case of a Perl character.*" ~~C people would call that a `char`.~~ The callback must always return a string of bytes, as shown in the examples not cited here. So to replace malformed characters with "???", just use `sub { '???' }` as value for CHECK. To replace them with their decimal ordinal value between @ signs, use `sub { sprintf '@%d@',shift }`. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply] [d/l] [select]
Re^4: Question about Encode module and CHECK parameter by Nocturnus (Scribe) on Aug 10, 2015 at 06:47 UTC
Re^5: Question about Encode module and CHECK parameter by afoken (Chancellor) on Aug 10, 2015 at 18:01 UTC
Some notes below your chosen depth have not been shown here

coderef for CHECK