in reply to Re: Unexpected interaction between decode_entities() and lc()
in thread Unexpected interaction between decode_entities() and lc()

use feature 'unicode_strings'; is an excellent point!

Since neither "use feature 'unicode_strings';", nor e.g. "use 5.016;" was declared, then lc does exactly as described above.

If by "described above" you mean "Only the characters A-Z change, to a-z respectively.", then I think your reading of the lc docs might be a little off, my understanding is that bytes is not the default behavior. The following test took a little fiddling to get the right values but it passes on all Perl releases starting with 5.8.1, 5.8.9, 5.10.1, up to 5.26 and shows the differences:

use warnings;
use strict;
use utf8;
use Test::More;

diag explain "Perl $]";
plan tests => $] ge '5.012' ? 15 : 11;
SKIP: {
	is "\N{U+00C9}", "É",           '\N{U+...} escape';
	skip 'Perl ge 5.12 required', 1 unless $] ge '5.012';
	ok utf8::is_utf8("\N{U+00C9}"), '\N{U+...} sets UTF8';
}
{
	ok !utf8::is_utf8("\x{C9}"),    '\x doesn\'t set UTF8';
	is lc("\x{C9}"), "\xC9",        'lc on non-UTF8 str';
	ok utf8::is_utf8("É"),          'str is UTF8';
	is lc("É"), "é",                'lc on UTF8 str';
}
{
	use bytes;
	ok !utf8::is_utf8("\x{C9}"),    'bytes: \x doesn\'t set UTF8';
	is lc("\x{C9}"), "\xC9",        'bytes: lc on non-UTF8 str';
	ok utf8::is_utf8("É"),          'bytes: str is UTF8';
	is lc("É"), $] lt '5.008009' ? "\xC9" : "\xC3\x89",
	                                'bytes: lc on UTF8 str';
}
SKIP: { skip 'Perl ge 5.12 required', 1 unless $] ge '5.012';
ok eval q{ do {
	use feature 'unicode_strings';
	ok !utf8::is_utf8("\x{C9}"),    'u_s: \x doesn\'t set UTF8';
	is lc("\x{C9}"), "é",           'u_s: lc on non-UTF8 str';
	ok utf8::is_utf8("É"),          'u_s: str is UTF8';
	is lc("É"), "é",                'u_s: lc on UTF8 str';
1 } }, 'unicode_strings works' or warn $@ }

Replies are listed 'Best First'.
Re^3: Unexpected interaction between decode_entities() and lc()
by vr (Curate) on Nov 15, 2017 at 11:00 UTC

    You are right, I was too quick to paste a quote from lc documentation page, it should have been the last, fall-through case.

    Also, kurisuto, my comment was not a solution, rather an attempt at explanation (to myself) of what was happening -- too rarely I deal with extended-ASCII, and yet not-utf8 strings. Proper fix (at least, for anything but one-time scripts) would be to always explicitly decode inputs from all sources, not hoping them to be Latin-1 only, and Perl silently doing "the right thing" in background.