Re: Unexpected interaction between decode

I think there's missing bit in previous answers, i.e. the lc only follows the documentation:

If use bytes is in effect: The results follow ASCII rules. Only the characters A-Z change, to a-z respectively.

(Emphasis mine.) Since neither "use feature 'unicode_strings';", nor e.g. "use 5.016;" was declared, then lc does exactly as described above. BTW I'm impressed with the decode_entities clever behavior i.e. output depending on utf8 flag of its argument.

Comment on Re: Unexpected interaction between decode_entities() and lc() Select or Download Code

Replies are listed 'Best First'.
Re^2: Unexpected interaction between decode_entities() and lc() by haukex (Archbishop) on Nov 15, 2017 at 09:31 UTC
`use feature 'unicode_strings';` is an excellent point! Since neither `"use feature 'unicode_strings';"`, nor e.g. `"use 5.016;"` was declared, then lc does exactly as described above. If by "described above" you mean "Only the characters A-Z change, to a-z respectively.", then I think your reading of the lc docs might be a little off, my understanding is that bytes is not the default behavior. The following test took a little fiddling to get the right values but it passes on all Perl releases starting with 5.8.1, 5.8.9, 5.10.1, up to 5.26 and shows the differences: use warnings; use strict; use utf8; use Test::More; diag explain "Perl $]"; plan tests => $] ge '5.012' ? 15 : 11; SKIP: { is "\N{U+00C9}", "É", '\N{U+...} escape'; skip 'Perl ge 5.12 required', 1 unless $] ge '5.012'; ok utf8::is_utf8("\N{U+00C9}"), '\N{U+...} sets UTF8'; } { ok !utf8::is_utf8("\x{C9}"), '\x doesn\'t set UTF8'; is lc("\x{C9}"), "\xC9", 'lc on non-UTF8 str'; ok utf8::is_utf8("É"), 'str is UTF8'; is lc("É"), "é", 'lc on UTF8 str'; } { use bytes; ok !utf8::is_utf8("\x{C9}"), 'bytes: \x doesn\'t set UTF8'; is lc("\x{C9}"), "\xC9", 'bytes: lc on non-UTF8 str'; ok utf8::is_utf8("É"), 'bytes: str is UTF8'; is lc("É"), $] lt '5.008009' ? "\xC9" : "\xC3\x89", 'bytes: lc on UTF8 str'; } SKIP: { skip 'Perl ge 5.12 required', 1 unless $] ge '5.012'; ok eval q{ do { use feature 'unicode_strings'; ok !utf8::is_utf8("\x{C9}"), 'u_s: \x doesn\'t set UTF8'; is lc("\x{C9}"), "é", 'u_s: lc on non-UTF8 str'; ok utf8::is_utf8("É"), 'u_s: str is UTF8'; is lc("É"), "é", 'u_s: lc on UTF8 str'; 1 } }, 'unicode_strings works' or warn $@ }	[reply] [d/l] [select]
Re^3: Unexpected interaction between decode_entities() and lc() by vr (Curate) on Nov 15, 2017 at 11:00 UTC
You are right, I was too quick to paste a quote from lc documentation page, it should have been the last, fall-through case. Also, kurisuto, my comment was not a solution, rather an attempt at explanation (to myself) of what was happening -- too rarely I deal with extended-ASCII, and yet not-utf8 strings. Proper fix (at least, for anything but one-time scripts) would be to always explicitly decode inputs from all sources, not hoping them to be Latin-1 only, and Perl silently doing "the right thing" in background.	[reply]
Re^2: Unexpected interaction between decode_entities() and lc() by kurisuto (Novice) on Nov 14, 2017 at 20:07 UTC
Aha! Adding "use feature 'unicode_strings';" fixed the problem. Thank you!	[reply]