Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re^2: Mixed Unicode and ANSI string comparisons?

by BrowserUk (Patriarch)
on Dec 14, 2015 at 23:32 UTC ( [id://1150304]=note: print w/replies, xml ) Need Help??


in reply to Re: Mixed Unicode and ANSI string comparisons?
in thread Mixed Unicode and ANSI string comparisons?

Now I'm even more depressed.

Inspecting the code I expected the output to consist of 4 Unicode and 4 non-Unicode scalars, (or possibly 8 Unicode if they were automatically converted for the comparison), but I get 5 non-Unicode and 3 Unicode?? What gives?

#! /usr/bin/perl use warnings; use strict; use feature qw{ say }; use utf8; use open OUT => ':utf8', ':std'; use Encode; my @strings = ("\N{LATIN SMALL LETTER C WITH CARON}", "c", "\N{LATIN SMALL LETTER C WITH CEDILLA}", "\N{LATIN SMALL LETTER C WITH ACUTE}"); push @strings, map encode('utf-8', $_), @strings; printf "%10s %u\n", $_, utf8::is_utf8( $_ ) for sort @strings; __END__ C:\test>\perl22\bin\perl.exe junk33.pl c 0 c 0 ç 0 ? 0 č 0 1 c 1 c 1

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^3: Mixed Unicode and ANSI string comparisons?
by choroba (Cardinal) on Dec 14, 2015 at 23:47 UTC
    Plain "c" in ASCII is indistinguishable from the "c" in UTF-8. In fact, all the 7-bit ASCII are part of the UTF-8.
    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
      Plain "c" in ASCII is indistinguishable from the "c" in UTF-8.

      I thought that the utf flag would distinguish strings that you've asked to be utf8 encoded; from those you haven't. Even if they both contain the same 7-bit codes.

      If that's not the case; perl's Unicode support is even more broken than I thought.

      In fact, all the 7-bit ASCII are part of the UTF-8.

      And if the non-Unicode strings contain 8-bit chars?


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
      In the absence of evidence, opinion is indistinguishable from prejudice.
        And if the non-Unicode strings contain 8-bit chars?
        That's what I tried to demostrate with the encode, but apparently failed. The utf flag is an internal thing, you shouldn't care about it. If you're getting strings of mixed encodings from the data, fix the data or the input routines to unify the encoding.
        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re^3: Mixed Unicode and ANSI string comparisons?
by Anonymous Monk on Dec 15, 2015 at 00:20 UTC
    (or possibly 8 Unicode if they were automatically converted for the comparison), but I get 5 non-Unicode and 3 Unicode?? What gives?
    No, as far as perl is concerned, you start with 4 Unicode strings and get 8 Unicode strings... in different storage formats. utf8 flag says pretty much nothing about "Unicodeness".

    Is that a problem that encode('utf-8', $_) returns what is indistinguishable from "Unicode string" (as people usually understand it)? Yes, it's a problem in practice.

    Think about it this way: "1" in perl is struct PV, 1 is struct IV, "1" + 1 is PVIV (if i remember correctly). Now, what would happen if, say, the string concatenation operator was '+' (plus)? How would you determine what $x + $y actually do? What if cmp did the same thing as <=>, ge was just like =? How would you sort numbers?

    That's the situation with "Unicode" and "binary" strings in Perl, pretty much. As Ricardo Signes said:
    Right now, you can write programs in Perl that handle all this correctly, using only one tool: extreme vigilance. Or, more likely, two tools: vigilance and a debugger.
    I personally Devel::Peek instead of debugger :)

    Oh, and here's an example of a non-Unicode string:

    "\x{FFFF_FFFF_FFFF}"
    (Unicode doesn't have such a big "codepoint")
      As Ricardo Signes said: Right now, you can write programs in Perl that handle all this correctly, using only one tool: extreme vigilance.

      That's the source of my depression!

      The "situation" I referred to is the desire of a customer to sort two sets of data together: 1 legacy set stored in ascii/ANSI/ISO-8859-x; and another newer set stored in Unicode. The problem is that the legacy set makes use of the extended ascii character set (8-bit chars) which don't convert to Unicode (easily).

      My take when asked about it was: don't! Keep two lists for lookup and don't mix them, because they cannot logically be sorted together. They countered by sorting two small subsets together (using Java) and saying that it was easier for their people to do lookups in a single list.

      It was at that point I asked my question here. My expectation was that sort would either throw an error; or sort them into two distinct groups, but I didn't know. (Or know how to check without doing a shitload of reading and trial and error.)

      The result of this thread is so depressing that I'm going to turn the work down and let them find someone else. (Shame. Could have been a nice in.)


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
      In the absence of evidence, opinion is indistinguishable from prejudice.
        newer set stored in Unicode

        This doesn't mean anything. Unicode is the complete standard, not a character set or encoding.

        A reply falls below the community's threshold of quality. You may see it by logging in.
        The problem is that the legacy set makes use of the extended ascii character set (8-bit chars) which don't convert to Unicode (easily).
        Hmm... why not?
        My take when asked about it was: don't! Keep two lists for lookup and don't mix them, because they cannot logically be sorted together. They countered by sorting two small subsets together (using Java) and saying that it was easier for their people to do lookups in a single list.
        Well, that doesn't look too difficult? Why not decode their legacy set (as in map Encode::decode( 'LEGACY_SET', $_ ), @set) and sort that? And if their set happens to be ISO-8859-1 (aka Latin-1), then decoding isn't even necessary (and that's the deal with utf8-off strings in perl; they're assumed to be in THAT encoding, although some people say it just looks like it :)
        The result of this thread is so depressing that I'm going to turn the work down and let them find someone else. (Shame. Could have been a nice in.)
        Shame indeed, because Perl is actually very good for Unicode stuff... Unicode::Collate::Locale, for example... but yeah, Perl's strings are a source of much confusion.
      No, as far as perl is concerned, you start with 4 Unicode strings and get 8 Unicode strings... in different storage formats. utf8 flag says pretty much nothing about "Unicodeness".
      And now I remembered that LATIN_SMALL_LETTER_C_WITH_CEDILLA is codepoint 231 and you didn't use feature 'unicode_strings' (or, more commonly, use 5.012)... So yeah, you had 3 Unicode strings and 1 non-Unicode, "c" being Unicode (utf8 off), CEDILLA non-Unicode (utf8 on...) Interesting, isn't it? :)
        Adding unicode_strings doesn't change the output in any way.
        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1150304]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (6)
As of 2024-03-28 12:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found