Re^2: Question about Encode module and CHECK parameter

Replies are listed 'Best First'.
Re^3: Question about Encode module and CHECK parameter by afoken (Chancellor) on Aug 09, 2015 at 08:20 UTC
The Encode documentation has it all, but it is a little bit hard to understand. It seems to be written by someone who deeply knows the implementation, which makes it hard to explain the interface. I looked at the source, no SUBCHAR in Encode.pm, but 13 hits in Encode.xs: (Ignore uppper and lower case when comparing with the documentation, and don't get confused by the different prefixes for some constants. Just look for names you know from the documentation. And despite the file is named Encode.xs, this ist just C code with some precompiler macros.) /* encoding / //// ... if (check & (ENCODE_PERLQQ\|ENCODE_HTMLCREF\|ENCODE_XMLCREF)){ SV subchar = (fallback_cb != &PL_sv_undef) ? do_fallback_cb(aTHX_ ch, fallback_cb) : newSVpvf(check & ENCODE_PERLQQ ? "\\x{%04"UVxf"}" : check & ENCODE_HTMLCREF ? "&#%" UVuf ";" : "&#x%" UVxf ";", (UV)ch); SvUTF8_off(subchar); /* make sure no decoded string gets in / sdone += slen + clen; ddone += dlen + SvCUR(subchar); sv_catsv(dst, subchar); SvREFCNT_dec(subchar); } else { / fallback char / sdone += slen + clen; ddone += dlen + enc->replen; sv_catpvn(dst, (char)enc->rep, enc->replen); } [download] and `/* decoding / //// ... if (check & (ENCODE_PERLQQ\|ENCODE_HTMLCREF\|ENCODE_XMLCREF)){ SV subchar = (fallback_cb != &PL_sv_undef) ? do_fallback_cb(aTHX_ (UV)s[slen], fallback_cb) : newSVpvf("\\x%02" UVXf, (UV)s[slen]); sdone += slen + 1; ddone += dlen + SvCUR(subchar); sv_catsv(dst, subchar); SvREFCNT_dec(subchar); } else { sdone += slen + 1; ddone += dlen + strlen(FBCHAR_UTF8); sv_catpv(dst, FBCHAR_UTF8); }` [download] and `malformed: //// ... if (check & (ENCODE_PERLQQ\|ENCODE_HTMLCREF\|ENCODE_XMLCREF)){ SV* subchar = (fallback_cb != &PL_sv_undef) ? do_fallback_cb(aTHX_ uv, fallback_cb) : newSVpvf(check & ENCODE_PERLQQ ? (ulen == 1 ? "\\x%02" UVXf : "\\x{%04" UVXf "}") : check & ENCODE_HTMLCREF ? "&#%" UVuf ";" : "&#x%" UVxf ";", uv); if (encode){ SvUTF8_off(subchar); /* make sure no decoded string gets in / } sv_catsv(dst, subchar); SvREFCNT_dec(subchar); } else { sv_catpv(dst, FBCHAR_UTF8); }` [download] I didn't attempt to fully understand what this code does. But it is quite obvious that the various constants for CHECK (`FB_PERLQQ`, `FB_HTMLCREF`, `FB_XMLCREF` in perl, the same with an `ENCODE_` prefix instead of `FB_` in XS) select how a malformed character is replaced. SUBCHAR is an unfortunate name, it is a substitute FOR a character, not A substitute character. In fact, it is a string, existing only as a local variable in XS. You can't access it from Perl. But there is another hint: `fallback_cb`, a callback function, called whenever a substitute for a malformed character is needed. This is what happens when CHECK is a code reference. Read coderef for CHECK: coderef for CHECK As of `Encode` 2.12, `CHECK` can also be a code reference which takes the ordinal value of the unmapped character as an argument and returns octets that represent the fallback character. ... Even the fallback for `decode` must return octets, which are then decoded with the character encoding that `decode` accepts. "Octets" are just what everyone else (except for the french) calls bytes. Encode uses the name "byte" for something different, "A character in the range 0..255; a special case of a Perl character.*" ~~C people would call that a `char`.~~ The callback must always return a string of bytes, as shown in the examples not cited here. So to replace malformed characters with "???", just use `sub { '???' }` as value for CHECK. To replace them with their decimal ordinal value between @ signs, use `sub { sprintf '@%d@',shift }`. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply] [d/l] [select]
Re^4: Question about Encode module and CHECK parameter by Nocturnus (Scribe) on Aug 10, 2015 at 06:47 UTC
Thanks again for your in-depth answer and the time you have put into the problem. I am somehow embarrassed that I didn't look into the source code myself. My excuse is that I was convinced that somebody out there already had a list showing which value SUBCHAR is under which circumstances. Regarding using a coderef for CHECK, I think I had got that in the meantime, but IMHO the question remains if a malformed character has an ordinal value at all (quite sure yes, but which?). Since the respective character is malformed, it probably does not have a Unicode code point, so I am unsure about what value shift would return in that case. In the meantime, I have decided that in my case it is best to use FB_QUIET for CHECK, so examining more deeply is not what I am planning to do. Maybe later ... I have asked question 2) mainly because I felt that not explaining SUBCHAR is a substantial lack of documentation and was hoping that I had overlooked something. Thank you very much again!	[reply]
Re^5: Question about Encode module and CHECK parameter by afoken (Chancellor) on Aug 10, 2015 at 18:01 UTC
I am somehow embarrassed that I didn't look into the source code myself. My excuse is that I was convinced that somebody out there already had a list showing which value SUBCHAR is under which circumstances. In a perfect world, you shouldn't have to look into the source, everything would be explained in the documentation. Regarding using a coderef for CHECK, I think I had got that in the meantime, but IMHO the question remains if a malformed character has an ordinal value at all (quite sure yes, but which?). I would assume that - for converting bytes pretending to be UTF-8 to perl's internal representation - the CHECK coderef would be called with the value of the (first) offending byte. For the other way (perl to UTF-8 bytestream), I would expect to get the (first) offending perl character that can't be expressed as UTF-8 bytestream. Let's test that: #!/usr/bin/perl use v5.10; use strict; use warnings; use Encode 2.12 qw(); my $octets="a\xFEb"; # ^-- byte 0xFE ist invalid in UTF-8, see https://en.wikipe +dia.org/wiki/Utf-8 my $string=Encode::decode( 'utf-8', $octets, sub { my $value=shift; return sprintf('<0x%04X>',$value); } ); say $string; $string="a\x{123456}b"; # ^-- Unicode is defined from 0 to 0x10FFFF $octets=Encode::encode( 'utf-8', $string, sub { my $value=shift; return sprintf('<0x%08X>',$value); } ); say $octets; $string="a\x{00C4}b\x{263A}c"; # a A-Umlaut b Smile c # ^-- not available in ISO-8859-1 $octets=Encode::encode( 'utf-8', $string, sub { die "Should not happen"; } ); # from_to() converts bytes, not characters. To make things easier, # I use a destination encoding where 1 byte = 1 character. Encode::from_to( $octets, # in-place 'utf-8', 'iso-8859-1', sub { my $value=shift; return sprintf('<0x%04X>',$value); } ); binmode STDOUT,':encoding(utf-8)'; # I use a UTF-8 terminal say $octets; # implicit converting from ISO-8859-1 to UTF-8 due to bin +mode above [download] Output: `a<0x00FE>b a<0x00123456>b aÄb<0x263A>c` [download] I have asked question 2) mainly because I felt that not explaining SUBCHAR is a substantial lack of documentation and was hoping that I had overlooked something. Yes, the documentation could be improved. If you can spend five minutes, file a bug. If you can spend an hour or two, create a patch for the POD and submit it. You know now quite well what's missing in the POD, and how it should be explained. Perhaps post a preview for discussion here. (If you are working for a boss (and not for fun), explain him/her that this little bit of time is a kind of "usage fee" for the huge amount of well-written code you use from the perl community. That was my argument for publishing the initial Unicode patch for DBD::ODBC, and my boss was quite happy with that. We had the patch, we needed it, and by publishing it, the Unicode support became even better. And the best: I don't have to support it any more. mje has merged it into DBD::ODBC and improved it since then.) Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply] [d/l] [select]
Re^6: Question about Encode module and CHECK parameter by Nocturnus (Scribe) on Aug 16, 2015 at 07:35 UTC

coderef for CHECK