comment on

The Encode documentation has it all, but it is a little bit hard to understand. It seems to be written by someone who deeply knows the implementation, which makes it hard to explain the interface.

I looked at the source, no SUBCHAR in Encode.pm, but 13 hits in Encode.xs:

(Ignore uppper and lower case when comparing with the documentation, and don't get confused by the different prefixes for some constants. Just look for names you know from the documentation. And despite the file is named Encode.xs, this ist just C code with some precompiler macros.)

        /* encoding */    
        //// ...
        if (check & (ENCODE_PERLQQ|ENCODE_HTMLCREF|ENCODE_XMLCREF)){
            SV* subchar = 
            (fallback_cb != &PL_sv_undef)
        ? do_fallback_cb(aTHX_ ch, fallback_cb)
        : newSVpvf(check & ENCODE_PERLQQ ? "\\x{%04"UVxf"}" :
                 check & ENCODE_HTMLCREF ? "&#%" UVuf ";" :
                 "&#x%" UVxf ";", (UV)ch);
        SvUTF8_off(subchar); /* make sure no decoded string gets in */
            sdone += slen + clen;
            ddone += dlen + SvCUR(subchar);
            sv_catsv(dst, subchar);
            SvREFCNT_dec(subchar);
        } else {
            /* fallback char */
            sdone += slen + clen;
            ddone += dlen + enc->replen;
            sv_catpvn(dst, (char*)enc->rep, enc->replen);
        }
[download]

and

        /* decoding */
        //// ...
        if (check &
            (ENCODE_PERLQQ|ENCODE_HTMLCREF|ENCODE_XMLCREF)){
            SV* subchar = 
            (fallback_cb != &PL_sv_undef)
        ? do_fallback_cb(aTHX_ (UV)s[slen], fallback_cb) 
        : newSVpvf("\\x%02" UVXf, (UV)s[slen]);
            sdone += slen + 1;
            ddone += dlen + SvCUR(subchar);
            sv_catsv(dst, subchar);
            SvREFCNT_dec(subchar);
        } else {
            sdone += slen + 1;
            ddone += dlen + strlen(FBCHAR_UTF8);
            sv_catpv(dst, FBCHAR_UTF8);
        }
[download]

and

    malformed:
        //// ...
        if (check & (ENCODE_PERLQQ|ENCODE_HTMLCREF|ENCODE_XMLCREF)){
        SV* subchar =
        (fallback_cb != &PL_sv_undef)
        ? do_fallback_cb(aTHX_ uv, fallback_cb)
        : newSVpvf(check & ENCODE_PERLQQ 
               ? (ulen == 1 ? "\\x%02" UVXf : "\\x{%04" UVXf "}")
               :  check & ENCODE_HTMLCREF ? "&#%" UVuf ";" 
               : "&#x%" UVxf ";", uv);
        if (encode){
        SvUTF8_off(subchar); /* make sure no decoded string gets in */
        }
            sv_catsv(dst, subchar);
            SvREFCNT_dec(subchar);
        } else {
            sv_catpv(dst, FBCHAR_UTF8);
        }
[download]

I didn't attempt to fully understand what this code does. But it is quite obvious that the various constants for CHECK (FB_PERLQQ, FB_HTMLCREF, FB_XMLCREF in perl, the same with an ENCODE_ prefix instead of FB_ in XS) select how a malformed character is replaced. SUBCHAR is an unfortunate name, it is a substitute FOR a character, not A substitute character. In fact, it is a string, existing only as a local variable in XS. You can't access it from Perl.

But there is another hint: fallback_cb, a callback function, called whenever a substitute for a malformed character is needed. This is what happens when CHECK is a code reference. Read coderef for CHECK:

coderef for CHECK

As of Encode 2.12, CHECK can also be a code reference which takes the ordinal value of the unmapped character as an argument and returns octets that represent the fallback character.

...

Even the fallback for decode must return octets, which are then decoded with the character encoding that decode accepts.

"Octets" are just what everyone else (except for the french) calls bytes. Encode uses the name "byte" for something different, "A character in the range 0..255; a special case of a Perl character." ~~C people would call that a char.~~ The callback must always return a string of bytes, as shown in the examples not cited here.

So to replace malformed characters with "???", just use sub { '???' } as value for CHECK. To replace them with their decimal ordinal value between @ signs, use sub { sprintf '@%d@',shift }.

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

In reply to Re^3: Question about Encode module and CHECK parameter by afoken
in thread Question about Encode module and CHECK parameter by Nocturnus

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.