flexvault has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I came across "What's Wrong with sort and How to Fix It" article by Tom Christiansen dated August 31, 2011 6:00 AM on Perl.com which stated:

If you have code that purports to sort text that looks like this: @sorted_lines = sort @lines; Then all you have to get a dictionary sort is write instead: use Unicode::Collate; @sorted_lines = Unicode::Collate::->new->sort(@lines);
Which looked really cool!

Now I have a script that verifies to me how string comparison operators work:

use strict; use warnings; use Benchmark qw(:all); for my $no ( 0 .. 255 ) { my ( $lt, $eq, $gt ); if ( $no == 0 ) { $lt = chr($no); if ( ! ( "" lt $lt ) ) { die "\"\" not lt \\0\n"; } if ( "" eq $lt ) { die "\"\" eq \\0\n"; } if ( "" gt $lt ) { die "\"\" gt \\0\n"; } } elsif ( $no == 255 ) { $lt = chr($no-1); $eq = chr($no); if ( ! ( $lt lt $eq ) ) { die "$no not lt \n"; } if ( chr($no) ne $eq ) { die "$no not eq \n"; } } else { $lt = chr($no-1); $eq = chr($no); $gt = chr($no+1); if ( ! ( $lt lt $eq ) ) { die "$no not lt \n"; } if ( chr($no) ne $eq ) { die "$no not eq \n"; } if ( ! ( $gt gt $eq ) ) { die "$gt not gt \n"; } } } print "Great!!!\n"; exit;
When you run the code it does this:
> time perl test_ltgteq.plx Great!!! real 0m0.046s user 0m0.036s sys 0m0.008s

So in order to see how "Unicode::Collate" would work I modified the code to the following:

use strict; use warnings; use Unicode::Collate; my $Collator = Unicode::Collate->new(); my $Print = ""; my $something = ""; ## Initialize if ( $Collator->eq( $something, "" ) ) { $Print .= "\$somethin +g eq \"\"\n"; } else { $Print .= "\$somethi +ng ne \"\"\n"; } for my $no ( 0 .. 255 ) { my ( $lt, $eq, $gt ); if ( $no == 0 ) { $lt = chr($no); if ( ! ( $Collator->lt( "", $lt ) ) ) { $Print .= "0: \"\" + not lt \\0\n"; } if ( $Collator->eq( "", $lt ) ) { $Print .= "0: \"\" e +q \\0\n"; } if ( $Collator->gt( "", $lt ) ) { $Print .= "0: \"\" g +t \\0\n"; } } elsif ( $no == 255 ) { $lt = chr($no-1); $eq = chr($no); if ( ! ( $Collator->lt( $lt, $eq ) ) ) { $Print .= "255 no +t lt $eq\n"; } if ( $Collator->ne( chr($no), $eq ) ) { $Print .= "255 not + eq $eq\n"; } } else { $lt = chr($no-1); $eq = chr($no); $gt = chr($no+1); if ( ! ( $Collator->lt( $lt, $eq ) ) ) { $Print .= ord($lt +)." not lt ".ord($eq)."\n"; } if ( $Collator->ne(chr($no), $eq ) ) { $Print .= "$no not +eq ".ord($eq)."\n"; } if ( ! ( $Collator->gt( $gt, $eq ) ) ) { $Print .= ord($gt +)." not gt ".ord($eq)."\n"; } } } $Print .= "\nGreat!!!\n\n"; print $Print; exit;
And when you execute it, you get the following:
> time perl test_ltgteq2.plx $something eq "" 0: "" not lt \0 0: "" eq \0 0 not lt 1 2 not gt 1 1 not lt 2 3 not gt 2 2 not lt 3 4 not gt 3 3 not lt 4 5 not gt 4 4 not lt 5 6 not gt 5 5 not lt 6 7 not gt 6 6 not lt 7 8 not gt 7 7 not lt 8 14 not gt 13 13 not lt 14 15 not gt 14 14 not lt 15 16 not gt 15 15 not lt 16 17 not gt 16 16 not lt 17 18 not gt 17 17 not lt 18 19 not gt 18 18 not lt 19 20 not gt 19 19 not lt 20 21 not gt 20 20 not lt 21 22 not gt 21 21 not lt 22 23 not gt 22 22 not lt 23 24 not gt 23 23 not lt 24 25 not gt 24 24 not lt 25 26 not gt 25 25 not lt 26 27 not gt 26 26 not lt 27 28 not gt 27 27 not lt 28 29 not gt 28 28 not lt 29 30 not gt 29 29 not lt 30 31 not gt 30 30 not lt 31 37 not gt 36 36 not lt 37 38 not gt 37 37 not lt 38 39 not gt 38 38 not lt 39 44 not gt 43 43 not lt 44 45 not gt 44 44 not lt 45 58 not gt 57 57 not lt 58 59 not gt 58 58 not lt 59 63 not gt 62 62 not lt 63 91 not gt 90 90 not lt 91 93 not gt 92 92 not lt 93 94 not gt 93 93 not lt 94 96 not gt 95 95 not lt 96 123 not gt 122 122 not lt 123 125 not gt 124 124 not lt 125 127 not gt 126 126 not lt 127 128 not gt 127 127 not lt 128 129 not gt 128 128 not lt 129 130 not gt 129 129 not lt 130 131 not gt 130 130 not lt 131 132 not gt 131 131 not lt 132 134 not gt 133 133 not lt 134 135 not gt 134 134 not lt 135 136 not gt 135 135 not lt 136 137 not gt 136 136 not lt 137 138 not gt 137 137 not lt 138 139 not gt 138 138 not lt 139 140 not gt 139 139 not lt 140 141 not gt 140 140 not lt 141 142 not gt 141 141 not lt 142 143 not gt 142 142 not lt 143 144 not gt 143 143 not lt 144 145 not gt 144 144 not lt 145 146 not gt 145 145 not lt 146 147 not gt 146 146 not lt 147 148 not gt 147 147 not lt 148 149 not gt 148 148 not lt 149 150 not gt 149 149 not lt 150 151 not gt 150 150 not lt 151 152 not gt 151 151 not lt 152 153 not gt 152 152 not lt 153 154 not gt 153 153 not lt 154 155 not gt 154 154 not lt 155 156 not gt 155 155 not lt 156 157 not gt 156 156 not lt 157 158 not gt 157 157 not lt 158 159 not gt 158 158 not lt 159 164 not gt 163 163 not lt 164 166 not gt 165 165 not lt 166 167 not gt 166 166 not lt 167 168 not gt 167 167 not lt 168 171 not gt 170 170 not lt 171 173 not gt 172 172 not lt 173 175 not gt 174 174 not lt 175 180 not gt 179 179 not lt 180 182 not gt 181 181 not lt 182 183 not gt 182 182 not lt 183 184 not gt 183 183 not lt 184 187 not gt 186 186 not lt 187 189 not gt 188 188 not lt 189 191 not gt 190 190 not lt 191 193 not gt 192 192 not lt 193 196 not gt 195 195 not lt 196 197 not gt 196 196 not lt 197 201 not gt 200 200 not lt 201 205 not gt 204 204 not lt 205 208 not gt 207 207 not lt 208 211 not gt 210 210 not lt 211 214 not gt 213 213 not lt 214 215 not gt 214 214 not lt 215 218 not gt 217 217 not lt 218 223 not gt 222 222 not lt 223 224 not gt 223 223 not lt 224 225 not gt 224 224 not lt 225 228 not gt 227 227 not lt 228 229 not gt 228 228 not lt 229 233 not gt 232 232 not lt 233 237 not gt 236 236 not lt 237 240 not gt 239 239 not lt 240 243 not gt 242 242 not lt 243 246 not gt 245 245 not lt 246 247 not gt 246 246 not lt 247 250 not gt 249 249 not lt 250 255 not gt 254 255 not lt ÿ Great!!! real 0m0.686s user 0m0.672s sys 0m0.008s

I added the "$something" to verify what I was seeing. I also changed "die" to "print" since there were so many differences.

What did I expect? Not what I got!

What I had expected was to see changes to the sort order after chr(127) and not before. Should I have added a parameter to "my $Collator = Unicode::Collate->new();" to get the sort to act like the Perl sort, or is this the way it is?

Thank you for your comments.

"Well done is better than well said." - Benjamin Franklin

Replies are listed 'Best First'.
Re: RFC: Is this the correct use of Unicode::Collate?
by tchrist (Pilgrim) on Jan 17, 2012 at 14:45 UTC
    It’s quite unclear what it is that you really want here. It doesn’t make much sense to run a text sort on nontextual data, which is what control characters are. Why should invisible characters make any difference to how text sorts? They shouldn’t. The whole point of the UCA is that it does not behave like a text-ignorant code-point sort; it’s a text sort, which is a completely different beastie. What you’ve showed seems completely correct to me, so I don’t understand what you want.

    I suppose that it is perhaps possible that you may wish to change the weighting of the variably-weighted elements such that they are not ignored the first time around. The constructor’s variable named-argument accepts four possible values to alter how otherwise-ignored code points are to be treated. Per the manpage:

    blanked
    Variable elements are made ignorable at levels 1 through 3; considered at the 4ᵗʰ level.
    non-ignorable
    Variable elements are not reset to ignorable.
    shifted
    Variable elements are made ignorable at levels 1 through 3 their level 4 weight is replaced by the old level 1 weight. Level 4 weight for Non-Variable elements is 0xFFFF.
    shift-trimmed
    Same as shifted, but all FFFF’s at the 4ᵗʰ level are trimmed.
    So perhaps variable => "non-ignorable" might be more to your liking, or one of the other three possible settings.

    You may also want to set upper_before_lower => 1.

      tchrist,

      A "common" practice for handling duplicate names in a database is to append non-printable characters after the name, in the order of insertion. This is like using base 32 (numbers 0 to 31 ) for appended characters. This allows duplicates and retains the order of insertion. You don't have a limit since when you fill the first character, you just add another as "\0" and continue from there. That would be broken with Unicode::Collate.

      The implication in the article was that you could replace 'sort' with 'Unicode::Collate'.

      Thank you

      "Well done is better than well said." - Benjamin Franklin

        The implication in the article was that you could replace 'sort' with 'Unicode::Collate'.

        And that seems to be the real problem. sort isn't broken (that's just link baiting), and neither is Unicode::Collate. They just do different things.

        The article does say

        Fortunately, you don't have to come up with your own algorithm for dictionary sorting, because Perl provides a standard class to do this for you: Unicode::Collate

        So despite its title, it doesn't mandate UC to be a universal replacement for sort, but just for one application.

        A "common" practice for handling duplicate names in a database is to append non-printable characters after the name, in the order of insertion. This is like using base 32 (numbers 0 to 31 ) for appended characters. This allows duplicates and retains the order of insertion. You don't have a limit since when you fill the first character, you just add another as "\0" and continue from there. That would be broken with Unicode::Collate.

        The implication in the article was that you could replace 'sort' with 'Unicode::Collate'.

        I’m afraid you’ve swapped my implication with your inference, as I implied no such thing — and what you’ve inferred in no way follows from what I wrote. Quoting myself, I wrote:
        If you have code that purports to sort text that looks like this:
        @sorted_lines = sort @lines;
        Then all you have to get a dictionary sort is write instead:
        use Unicode::Collate; @sorted_lines = Unicode::Collate::->new->sort(@lines);
        See the red part? Clearly, you do not have ‘code that purports to sort text’! Therefore, nothing I wrote applies to you.

        You have code that blindly does a mindless numeric sort on code points, not an alphabetic sort on text. What you are doing is not an alphabetic sort. Plus sorting of textual representations of numbers is specifically outside the scope of the UCA.

        Of course it’s trivial to modify the UCA sort to take care of your weirdo situation, such that it does a proper text sort on the text and a weirdo binary sort on the binary. But you have to tell it to do that. It doesn’t play mind games with you; here as always, one has to know what one is doing, and why.

        A "common" practice for handling duplicate names in a database is to append non-printable characters after the name, in the order of insertion.

        What you need is an invisible letter in Unicode. Just such a letter was proposed several years ago by typographer Michael Everson. His proposed name for the character was INVISIBLE LETTER. Unfortunately, the Unicode Consortium rejected his proposal. See Proposal to add INVISIBLE LETTER to the UCS and Every character has a story #11: U+???? (The Invisible Letter)

        If there were such an invisible Unicode character, you could do something like this:

        #!perl use strict; use warnings; use open qw( :std :encoding(UTF-8) ); use charnames qw( :full ); use Unicode::Collate; my $DISAMBIGUATOR_CHARACTER = "\N{LATIN SMALL LIGATURE FFL}"; # U+FB04 my %president_number_by; # President number by president name my %seen; while (<DATA>) { chomp; my ($name, $number) = split m/,/, $_, 2; $seen{$name} = exists $seen{$name} ? $seen{$name} . $DISAMBIGUATOR_CHARACTER : $name ; $president_number_by{$seen{$name}} = $number; } my $collator = Unicode::Collate->new(); for my $name ($collator->sort(keys %president_number_by)) { my $number = $president_number_by{$name}; $name =~ s/$DISAMBIGUATOR_CHARACTER+$//; print "$name,$number\n"; } exit 0; __DATA__ Washington,1 Adams,2 Jefferson,3 Madison,4 Monroe,5 Adams,6 Jackson,7 Van Buren,8 Harrison,9 Tyler,10 Polk,11 Taylor,12 Fillmore,13 Pierce,14 Buchanan,15 Lincoln,16 Johnson,17 Simpson,18 Hayes,19 Garfield,20 Arthur,21 Cleveland,22 Harrison,23 Cleveland,24 McKinley,25 Roosevelt,26 Taft,27 Wilson,28 Harding,29 Coolidge,30 Hoover,31 Roosevelt,32 Truman,33 Eisenhower,34 Kennedy,35 Johnson,36 Nixon,37 Ford,38 Carter,39 Reagan,40 Bush,41 Clinton,42 Bush,43 Obama,44 Bush,45

        This script produces this output:

        Adams,2 Adams,6 Arthur,21 Buchanan,15 Bush,41 Bush,43 Bush,45 Carter,39 Cleveland,22 Cleveland,24 Clinton,42 Coolidge,30 Eisenhower,34 Fillmore,13 Ford,38 Garfield,20 Harding,29 Harrison,9 Harrison,23 Hayes,19 Hoover,31 Jackson,7 Jefferson,3 Johnson,17 Johnson,36 Kennedy,35 Lincoln,16 Madison,4 McKinley,25 Monroe,5 Nixon,37 Obama,44 Pierce,14 Polk,11 Reagan,40 Roosevelt,26 Roosevelt,32 Simpson,18 Taft,27 Taylor,12 Truman,33 Tyler,10 Van Buren,8 Washington,1 Wilson,28

        (For the purpose of demonstrating more than two presidents with the same last name, I had to assume Barack Obama is re-elected in 2012 and Jeb Bush is elected in 2016. I'm sorry if this prospect offends you.)

        This is a pure Unicode solution to the problem. There's no commingling of Unicode characters or graphemes with binary data. Unfortunately, however, there isn't a Unicode character with the general property L (Letter) that's guaranteed to be invisible. If there were, it would be just the right character to use for this "weirdo" purpose.

        Why did I use the Unicode character LATIN SMALL LIGATURE FFL in the demo script? I don't know exactly. Maybe because it's a character that collates high and seems impossibly unlikely ever to occur in real data.

        Jim