in reply to Unexpected interaction between decode_entities() and lc()

I think there's missing bit in previous answers, i.e. the lc only follows the documentation:

If use bytes is in effect: The results follow ASCII rules. Only the characters A-Z change, to a-z respectively.

(Emphasis mine.) Since neither "use feature 'unicode_strings';", nor e.g. "use 5.016;" was declared, then lc does exactly as described above. BTW I'm impressed with the decode_entities clever behavior i.e. output depending on utf8 flag of its argument.

Replies are listed 'Best First'.
Re^2: Unexpected interaction between decode_entities() and lc()
by haukex (Archbishop) on Nov 15, 2017 at 09:31 UTC

    use feature 'unicode_strings'; is an excellent point!

    Since neither "use feature 'unicode_strings';", nor e.g. "use 5.016;" was declared, then lc does exactly as described above.

    If by "described above" you mean "Only the characters A-Z change, to a-z respectively.", then I think your reading of the lc docs might be a little off, my understanding is that bytes is not the default behavior. The following test took a little fiddling to get the right values but it passes on all Perl releases starting with 5.8.1, 5.8.9, 5.10.1, up to 5.26 and shows the differences:

    use warnings;
    use strict;
    use utf8;
    use Test::More;
    
    diag explain "Perl $]";
    plan tests => $] ge '5.012' ? 15 : 11;
    SKIP: {
    	is "\N{U+00C9}", "É",           '\N{U+...} escape';
    	skip 'Perl ge 5.12 required', 1 unless $] ge '5.012';
    	ok utf8::is_utf8("\N{U+00C9}"), '\N{U+...} sets UTF8';
    }
    {
    	ok !utf8::is_utf8("\x{C9}"),    '\x doesn\'t set UTF8';
    	is lc("\x{C9}"), "\xC9",        'lc on non-UTF8 str';
    	ok utf8::is_utf8("É"),          'str is UTF8';
    	is lc("É"), "é",                'lc on UTF8 str';
    }
    {
    	use bytes;
    	ok !utf8::is_utf8("\x{C9}"),    'bytes: \x doesn\'t set UTF8';
    	is lc("\x{C9}"), "\xC9",        'bytes: lc on non-UTF8 str';
    	ok utf8::is_utf8("É"),          'bytes: str is UTF8';
    	is lc("É"), $] lt '5.008009' ? "\xC9" : "\xC3\x89",
    	                                'bytes: lc on UTF8 str';
    }
    SKIP: { skip 'Perl ge 5.12 required', 1 unless $] ge '5.012';
    ok eval q{ do {
    	use feature 'unicode_strings';
    	ok !utf8::is_utf8("\x{C9}"),    'u_s: \x doesn\'t set UTF8';
    	is lc("\x{C9}"), "é",           'u_s: lc on non-UTF8 str';
    	ok utf8::is_utf8("É"),          'u_s: str is UTF8';
    	is lc("É"), "é",                'u_s: lc on UTF8 str';
    1 } }, 'unicode_strings works' or warn $@ }
    

      You are right, I was too quick to paste a quote from lc documentation page, it should have been the last, fall-through case.

      Also, kurisuto, my comment was not a solution, rather an attempt at explanation (to myself) of what was happening -- too rarely I deal with extended-ASCII, and yet not-utf8 strings. Proper fix (at least, for anything but one-time scripts) would be to always explicitly decode inputs from all sources, not hoping them to be Latin-1 only, and Perl silently doing "the right thing" in background.

Re^2: Unexpected interaction between decode_entities() and lc()
by kurisuto (Novice) on Nov 14, 2017 at 20:07 UTC
    Aha! Adding "use feature 'unicode_strings';" fixed the problem. Thank you!