Mixed Unicode and ANSI string comparisons?

BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

What happens if you pass a list containing a mix of Unicode and non-Unicode scalars to sort?

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)

In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Mixed Unicode and ANSI string comparisons?

Replies are listed 'Best First'.
Re: Mixed Unicode and ANSI string comparisons? by choroba (Cardinal) on Dec 14, 2015 at 22:18 UTC
Hi BrowserUk, welcome to the Monastery! What have you tried? `#! /usr/bin/perl use warnings; use strict; use feature qw{ say }; use open OUT => ':utf8', ':std'; use Encode; my @strings = ("\N{LATIN SMALL LETTER C WITH CARON}", "c", "\N{LATIN SMALL LETTER C WITH CEDILLA}", "\N{LATIN SMALL LETTER C WITH ACUTE}"); my $i = 4; push @strings, map encode('utf-8', $_), @strings; say join ',', map "$_: " . ord, split // for sort @strings;` [download] ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]
Re^2: Mixed Unicode and ANSI string comparisons? by BrowserUk (Patriarch) on Dec 14, 2015 at 23:32 UTC
Now I'm even more depressed. Inspecting the code I expected the output to consist of 4 Unicode and 4 non-Unicode scalars, (or possibly 8 Unicode if they were automatically converted for the comparison), but I get 5 non-Unicode and 3 Unicode?? What gives? `#! /usr/bin/perl use warnings; use strict; use feature qw{ say }; use utf8; use open OUT => ':utf8', ':std'; use Encode; my @strings = ("\N{LATIN SMALL LETTER C WITH CARON}", "c", "\N{LATIN SMALL LETTER C WITH CEDILLA}", "\N{LATIN SMALL LETTER C WITH ACUTE}"); push @strings, map encode('utf-8', $_), @strings; printf "%10s %u\n", $_, utf8::is_utf8( $_ ) for sort @strings; __END__ C:\test>\perl22\bin\perl.exe junk33.pl c 0 c 0 Ã§ 0 Ä? 0 Ä 0 ç 1 c 1 c 1` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^3: Mixed Unicode and ANSI string comparisons? by choroba (Cardinal) on Dec 14, 2015 at 23:47 UTC
Plain "c" in ASCII is indistinguishable from the "c" in UTF-8. In fact, all the 7-bit ASCII are part of the UTF-8. ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l]
Re^4: Mixed Unicode and ANSI string comparisons? by BrowserUk (Patriarch) on Dec 15, 2015 at 01:17 UTC
Re^5: Mixed Unicode and ANSI string comparisons? by choroba (Cardinal) on Dec 15, 2015 at 08:44 UTC
Some notes below your chosen depth have not been shown here
Re^3: Mixed Unicode and ANSI string comparisons? by Anonymous Monk on Dec 15, 2015 at 00:20 UTC
(or possibly 8 Unicode if they were automatically converted for the comparison), but I get 5 non-Unicode and 3 Unicode?? What gives? No, as far as perl is concerned, you start with 4 Unicode strings and get 8 Unicode strings... in different storage formats. `utf8` flag says pretty much nothing about "Unicodeness". Is that a problem that `encode('utf-8', $_)` returns what is indistinguishable from "Unicode string" (as people usually understand it)? Yes, it's a problem in practice. Think about it this way: "1" in perl is struct PV, 1 is struct IV, "1" + 1 is PVIV (if i remember correctly). Now, what would happen if, say, the string concatenation operator was '+' (plus)? How would you determine what `$x + $y` actually do? What if `cmp` did the same thing as `<=>`, `ge` was just like `=`? How would you sort numbers? That's the situation with "Unicode" and "binary" strings in Perl, pretty much. As Ricardo Signes said: Right now, you can write programs in Perl that handle all this correctly, using only one tool: extreme vigilance. Or, more likely, two tools: vigilance and a debugger. I personally `Devel::Peek` instead of debugger :) Oh, and here's an example of a non-Unicode string: `"\x{FFFF_FFFF_FFFF}"` [download] (Unicode doesn't have such a big "codepoint")	[reply] [d/l] [select]
Re^4: Mixed Unicode and ANSI string comparisons? by BrowserUk (Patriarch) on Dec 15, 2015 at 01:14 UTC
Re^5: Mixed Unicode and ANSI string comparisons? by Your Mother (Archbishop) on Dec 15, 2015 at 18:49 UTC
Some notes below your chosen depth have not been shown here
Re^5: Mixed Unicode and ANSI string comparisons? by Anonymous Monk on Dec 15, 2015 at 01:44 UTC
Some notes below your chosen depth have not been shown here
Re^4: Mixed Unicode and ANSI string comparisons? by Anonymous Monk on Dec 15, 2015 at 01:15 UTC
Re^5: Mixed Unicode and ANSI string comparisons? by choroba (Cardinal) on Dec 15, 2015 at 09:08 UTC
Some notes below your chosen depth have not been shown here
Re^2: Mixed Unicode and ANSI string comparisons? by Anonymous Monk on Dec 15, 2015 at 00:51 UTC
Why do you have both `use open OUT => ':utf8', ':std';` and `map encode('utf-8', $_), @strings;`?	[reply] [d/l] [select]
Re^3: Mixed Unicode and ANSI string comparisons? by Anonymous Monk on Dec 15, 2015 at 01:01 UTC
Because without `open` perl would try to downgrade `"$_: "`, and warn that it can't do it for some strings ("wide character ...")	[reply] [d/l] [select]
Re^4: Mixed Unicode and ANSI string comparisons? ( binmode utf8 and :encoding(utf8)) by Anonymous Monk on Dec 15, 2015 at 01:22 UTC
Re^5: Mixed Unicode and ANSI string comparisons? ( binmode utf8 and :encoding(utf8)) by Anonymous Monk on Dec 15, 2015 at 02:55 UTC
Some notes below your chosen depth have not been shown here
Re^2: Mixed Unicode and ANSI string comparisons? by BrowserUk (Patriarch) on Dec 14, 2015 at 22:48 UTC
What have you tried? Nothing! I know just enough about Unicrap to know that I want nothing to do with it. But something came up. Hence; I'm asking for expert help. `#! /usr/bin/perl use warnings; use strict; use feature qw{ say }; use open OUT => ':utf8', ':std'; use Encode; my @strings = ("\N{LATIN SMALL LETTER C WITH CARON}", "c", "\N{LATIN SMALL LETTER C WITH CEDILLA}", "\N{LATIN SMALL LETTER C WITH ACUTE}"); push @strings, map encode('utf-8', $_), @strings; say for sort @strings;` [download] Hm. That code tells me nothing useful and neither does the output: `C:\test>\perl22\bin\perl.exe junk33.pl c c ├â┬º ├ä┬ç ├ä┬ì ├º ─ç ─ì` [download] Except maybe that sort readily accept mixed scalars, which doesn't make any sense at all to me. How can it compare and collate strings that exist in two entirely different encoding spaces? And interleaving them is like interleaving Chinese and Cyrillic strings; make no sense at all. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^3: Mixed Unicode and ANSI string comparisons? by Anonymous Monk on Dec 14, 2015 at 22:57 UTC
How can it compare and collate strings that exist in two entirely different encoding spaces? Yeah... just like that :) To be fair, Unicode::Collate is what does real collating, and it's very good.	[reply]
Re^3: Mixed Unicode and ANSI string comparisons? by exilepanda (Friar) on Dec 17, 2015 at 08:28 UTC
`#! /usr/bin/perl` is useless in windows ( and wrong path here ) Your code don't need that complex. Follow code works as well `use warnings; use strict; use utf8; use feature "say"; binmode STDOUT, ":utf8"; my $unicodeStr = "..."; # assign your own Unicode char as I can't type + Unicode in here say for sort ($uniCodeStr, "user login" ) ;` [download] One problem is that, when you run the script inside a CMD console, it's default code page is ANSI, but whatever codepage you set (chcp) , you will not able to print a proper result ( from my experience ). To print a proper result, you might want to use tools like PowerShell or NppExec(a plugin) with Notepad++ ( you also have to set the output encoding for this plugin )	[reply] [d/l] [select]
Re: Mixed Unicode and ANSI string comparisons? by Anonymous Monk on Dec 14, 2015 at 22:28 UTC
The strings get sorted by codepoints. Other than that, nothing (AFAIK). What's an 'ANSI' string.	[reply]
Re^2: Mixed Unicode and ANSI string comparisons? by BrowserUk (Patriarch) on Dec 14, 2015 at 22:50 UTC
LMGTFY https://en.wikipedia.org/wiki/Windows_ANSI_code_page :) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^3: Mixed Unicode and ANSI string comparisons? by Anonymous Monk on Dec 14, 2015 at 22:54 UTC
Oh, I remember. I heard that expression from Delphi programmers, and yeah, I suspected that was some Windows-specific lingo :)	[reply]
Re: Mixed Unicode and ANSI string comparisons? by 1nickt (Canon) on Dec 16, 2015 at 10:15 UTC
Boy, you really are a piece of work! You post a sniveling whine disguised as a question (with no code, description of what you've tried, or expected output), crow about how you've deliberately kept your head in the sand about a key element of modern programming, act like Perl and the world should be arranged to your whim, and then call people rude names when they try to help you. Quite a panoply of antisocial, immature behaviours, for which others less experienced than you would be rightly chastised. Try this module, princess. The way forward always starts with a minimal test.	[reply]
Re^2: Mixed Unicode and ANSI string comparisons? by Anonymous Monk on Dec 16, 2015 at 11:08 UTC
1nickt: your job isn't to go around defending every user you have insulted -- flip flopping is a flop	[reply]
Re^2: Mixed Unicode and ANSI string comparisons? by BrowserUk (Patriarch) on Dec 16, 2015 at 12:36 UTC
call people rude names when they try to help you. Pedantry is never "helpful"; bandwagon jumping is less so. I commiserate entirely with those that have to deal with this shit; I'm in a position where I don't have to. Jealous much. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice. .	[reply]

Back to Seekers of Perl Wisdom