RFC: Is this the correct use of Unicode::Collate?

flexvault has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I came across "What's Wrong with sort and How to Fix It" article by Tom Christiansen dated August 31, 2011 6:00 AM on Perl.com which stated:

If you have code that purports to sort text that looks like this:

@sorted_lines = sort @lines;

Then all you have to get a dictionary sort is write instead:

use Unicode::Collate;
@sorted_lines = Unicode::Collate::->new->sort(@lines);
[download]

Which looked really cool!

Now I have a script that verifies to me how string comparison operators work:

use strict;
use warnings;
use Benchmark qw(:all);

for my $no ( 0 .. 255 )
{    my ( $lt, $eq, $gt );
    if ( $no == 0 )
    {    $lt = chr($no);
        if ( ! ( "" lt $lt ) ) { die "\"\" not lt \\0\n"; }
        if ( "" eq $lt ) { die "\"\" eq \\0\n"; }
        if ( "" gt $lt ) { die "\"\" gt \\0\n"; }
    }
    elsif ( $no == 255 )
    {    $lt = chr($no-1); $eq = chr($no);
        if ( ! ( $lt lt $eq ) ) { die "$no not lt \n"; }
        if ( chr($no) ne $eq ) { die "$no not eq \n"; }
    }
    else
    {    $lt = chr($no-1); $eq = chr($no); $gt = chr($no+1);
        if ( ! ( $lt lt $eq ) ) { die "$no not lt \n"; }
        if ( chr($no) ne $eq ) { die "$no not eq \n"; }
        if ( ! ( $gt gt $eq ) ) { die "$gt not gt \n"; }
    }

}
print "Great!!!\n";
exit;
[download]

When you run the code it does this:

> time perl test_ltgteq.plx
Great!!!

real    0m0.046s
user    0m0.036s
sys     0m0.008s
[download]

So in order to see how "Unicode::Collate" would work I modified the code to the following:

use strict;
use warnings;
use Unicode::Collate;

my $Collator = Unicode::Collate->new(); my $Print = "";

my $something = "";        ## Initialize
if ( $Collator->eq( $something, "" ) )         { $Print .= "\$somethin
+g eq \"\"\n"; }
else                                            { $Print .= "\$somethi
+ng ne \"\"\n"; }


for my $no ( 0 .. 255 )
{    my ( $lt, $eq, $gt );
    if ( $no == 0 )
    {    $lt = chr($no);
        if ( ! ( $Collator->lt( "", $lt ) ) )     { $Print .= "0: \"\"
+ not lt \\0\n"; }
        if ( $Collator->eq( "", $lt ) )         { $Print .= "0: \"\" e
+q \\0\n"; }
        if ( $Collator->gt( "", $lt ) )         { $Print .= "0: \"\" g
+t \\0\n"; }
    }
    elsif ( $no == 255 )
    {    $lt = chr($no-1); $eq = chr($no);
        if ( ! ( $Collator->lt( $lt, $eq ) ) )     { $Print .= "255 no
+t lt $eq\n"; }
        if ( $Collator->ne( chr($no), $eq ) )     { $Print .= "255 not
+ eq $eq\n"; }
    }
    else
    {    $lt = chr($no-1); $eq = chr($no); $gt = chr($no+1);
        if ( ! ( $Collator->lt( $lt, $eq ) ) )     { $Print .= ord($lt
+)." not lt ".ord($eq)."\n"; }
        if ( $Collator->ne(chr($no), $eq ) )     { $Print .= "$no not 
+eq ".ord($eq)."\n"; }
        if ( ! ( $Collator->gt( $gt, $eq ) ) )     { $Print .= ord($gt
+)." not gt ".ord($eq)."\n"; }
    }

}
$Print .= "\nGreat!!!\n\n";
print $Print;
exit;
[download]

And when you execute it, you get the following:

> time perl test_ltgteq2.plx
$something eq ""
0: "" not lt \0
0: "" eq \0
0 not lt 1
2 not gt 1
1 not lt 2
3 not gt 2
2 not lt 3
4 not gt 3
3 not lt 4
5 not gt 4
4 not lt 5
6 not gt 5
5 not lt 6
7 not gt 6
6 not lt 7
8 not gt 7
7 not lt 8
14 not gt 13
13 not lt 14
15 not gt 14
14 not lt 15
16 not gt 15
15 not lt 16
17 not gt 16
16 not lt 17
18 not gt 17
17 not lt 18
19 not gt 18
18 not lt 19
20 not gt 19
19 not lt 20
21 not gt 20
20 not lt 21
22 not gt 21
21 not lt 22
23 not gt 22
22 not lt 23
24 not gt 23
23 not lt 24
25 not gt 24
24 not lt 25
26 not gt 25
25 not lt 26
27 not gt 26
26 not lt 27
28 not gt 27
27 not lt 28
29 not gt 28
28 not lt 29
30 not gt 29
29 not lt 30
31 not gt 30
30 not lt 31
37 not gt 36
36 not lt 37
38 not gt 37
37 not lt 38
39 not gt 38
38 not lt 39
44 not gt 43
43 not lt 44
45 not gt 44
44 not lt 45
58 not gt 57
57 not lt 58
59 not gt 58
58 not lt 59
63 not gt 62
62 not lt 63
91 not gt 90
90 not lt 91
93 not gt 92
92 not lt 93
94 not gt 93
93 not lt 94
96 not gt 95
95 not lt 96
123 not gt 122
122 not lt 123
125 not gt 124
124 not lt 125
127 not gt 126
126 not lt 127
128 not gt 127
127 not lt 128
129 not gt 128
128 not lt 129
130 not gt 129
129 not lt 130
131 not gt 130
130 not lt 131
132 not gt 131
131 not lt 132
134 not gt 133
133 not lt 134
135 not gt 134
134 not lt 135
136 not gt 135
135 not lt 136
137 not gt 136
136 not lt 137
138 not gt 137
137 not lt 138
139 not gt 138
138 not lt 139
140 not gt 139
139 not lt 140
141 not gt 140
140 not lt 141
142 not gt 141
141 not lt 142
143 not gt 142
142 not lt 143
144 not gt 143
143 not lt 144
145 not gt 144
144 not lt 145
146 not gt 145
145 not lt 146
147 not gt 146
146 not lt 147
148 not gt 147
147 not lt 148
149 not gt 148
148 not lt 149
150 not gt 149
149 not lt 150
151 not gt 150
150 not lt 151
152 not gt 151
151 not lt 152
153 not gt 152
152 not lt 153
154 not gt 153
153 not lt 154
155 not gt 154
154 not lt 155
156 not gt 155
155 not lt 156
157 not gt 156
156 not lt 157
158 not gt 157
157 not lt 158
159 not gt 158
158 not lt 159
164 not gt 163
163 not lt 164
166 not gt 165
165 not lt 166
167 not gt 166
166 not lt 167
168 not gt 167
167 not lt 168
171 not gt 170
170 not lt 171
173 not gt 172
172 not lt 173
175 not gt 174
174 not lt 175
180 not gt 179
179 not lt 180
182 not gt 181
181 not lt 182
183 not gt 182
182 not lt 183
184 not gt 183
183 not lt 184
187 not gt 186
186 not lt 187
189 not gt 188
188 not lt 189
191 not gt 190
190 not lt 191
193 not gt 192
192 not lt 193
196 not gt 195
195 not lt 196
197 not gt 196
196 not lt 197
201 not gt 200
200 not lt 201
205 not gt 204
204 not lt 205
208 not gt 207
207 not lt 208
211 not gt 210
210 not lt 211
214 not gt 213
213 not lt 214
215 not gt 214
214 not lt 215
218 not gt 217
217 not lt 218
223 not gt 222
222 not lt 223
224 not gt 223
223 not lt 224
225 not gt 224
224 not lt 225
228 not gt 227
227 not lt 228
229 not gt 228
228 not lt 229
233 not gt 232
232 not lt 233
237 not gt 236
236 not lt 237
240 not gt 239
239 not lt 240
243 not gt 242
242 not lt 243
246 not gt 245
245 not lt 246
247 not gt 246
246 not lt 247
250 not gt 249
249 not lt 250
255 not gt 254
255 not lt ÿ

Great!!!


real    0m0.686s
user    0m0.672s
sys     0m0.008s
[download]

I added the "$something" to verify what I was seeing. I also changed "die" to "print" since there were so many differences.

What did I expect? Not what I got!

What I had expected was to see changes to the sort order after chr(127) and not before. Should I have added a parameter to "my $Collator = Unicode::Collate->new();" to get the sort to act like the Perl sort, or is this the way it is?

Thank you for your comments.

"Well done is better than well said." - Benjamin Franklin

Comment on RFC: Is this the correct use of Unicode::Collate? Select or Download Code

Replies are listed 'Best First'.
Re: RFC: Is this the correct use of Unicode::Collate? by tchrist (Pilgrim) on Jan 17, 2012 at 14:45 UTC
It’s quite unclear what it is that you really want here. It doesn’t make much sense to run a text sort on nontextual data, which is what control characters are. Why should invisible characters make any difference to how text sorts? They shouldn’t. The whole point of the UCA is that it does not behave like a text-ignorant code-point sort; it’s a text sort, which is a completely different beastie. What you’ve showed seems completely correct to me, so I don’t understand what you want. I suppose that it is perhaps possible that you may wish to change the weighting of the variably-weighted elements such that they are not ignored the first time around. The constructor’s `variable` named-argument accepts four possible values to alter how otherwise-ignored code points are to be treated. Per the manpage: `blanked` Variable elements are made ignorable at levels 1 through 3; considered at the 4ᵗʰ level. `non-ignorable` Variable elements are not reset to ignorable. `shifted` Variable elements are made ignorable at levels 1 through 3 their level 4 weight is replaced by the old level 1 weight. Level 4 weight for Non-Variable elements is 0xFFFF. `shift-trimmed` Same as `shifted`, but all FFFF’s at the 4ᵗʰ level are trimmed. So perhaps `variable => "non-ignorable"` might be more to your liking, or one of the other three possible settings. You may also want to set `upper_before_lower => 1`.	[reply]
Re^2: RFC: Is this the correct use of Unicode::Collate? by flexvault (Monsignor) on Jan 17, 2012 at 15:31 UTC
tchrist, A "common" practice for handling duplicate names in a database is to append non-printable characters after the name, in the order of insertion. This is like using base 32 (numbers 0 to 31 ) for appended characters. This allows duplicates and retains the order of insertion. You don't have a limit since when you fill the first character, you just add another as "\0" and continue from there. That would be broken with Unicode::Collate. The implication in the article was that you could replace 'sort' with 'Unicode::Collate'. Thank you "Well done is better than well said." - Benjamin Franklin	[reply]
Re^3: RFC: Is this the correct use of Unicode::Collate? by moritz (Cardinal) on Jan 17, 2012 at 15:39 UTC
The implication in the article was that you could replace 'sort' with 'Unicode::Collate'. And that seems to be the real problem. sort isn't broken (that's just link baiting), and neither is Unicode::Collate. They just do different things. The article does say Fortunately, you don't have to come up with your own algorithm for dictionary sorting, because Perl provides a standard class to do this for you: Unicode::Collate So despite its title, it doesn't mandate UC to be a universal replacement for sort, but just for one application. Perl 6 - second systems done right	[reply]
Re^4: RFC: Is this the correct use of Unicode::Collate? by flexvault (Monsignor) on Jan 17, 2012 at 16:10 UTC
Re^5: RFC: Is this the correct use of Unicode::Collate? by tchrist (Pilgrim) on Jan 17, 2012 at 18:53 UTC
Some notes below your chosen depth have not been shown here
Re^3: RFC: Is this the correct use of Unicode::Collate? by tchrist (Pilgrim) on Jan 17, 2012 at 16:19 UTC
A "common" practice for handling duplicate names in a database is to append non-printable characters after the name, in the order of insertion. This is like using base 32 (numbers 0 to 31 ) for appended characters. This allows duplicates and retains the order of insertion. You don't have a limit since when you fill the first character, you just add another as "\0" and continue from there. That would be broken with Unicode::Collate. The implication in the article was that you could replace 'sort' with 'Unicode::Collate'. I’m afraid you’ve swapped my implication with your inference, as I implied no such thing — and what you’ve inferred in no way follows from what I wrote. Quoting myself, I wrote: If you have code that purports to sort text that looks like this: `@sorted_lines = sort @lines;` Then all you have to get a dictionary sort is write instead: `use Unicode::Collate; @sorted_lines = Unicode::Collate::->new->sort(@lines);` [download] See the red part? Clearly, you do not have ‘code that purports to sort text’! Therefore, nothing I wrote applies to you. You have code that blindly does a mindless numeric sort on code points, not an alphabetic sort on text. What you are doing is not an alphabetic sort. Plus sorting of textual representations of numbers is specifically outside the scope of the UCA. Of course it’s trivial to modify the UCA sort to take care of your weirdo situation, such that it does a proper text sort on the text and a weirdo binary sort on the binary. But you have to tell it to do that. It doesn’t play mind games with you; here as always, one has to know what one is doing, and why.	[reply] [d/l] [select]
Re^4: RFC: Is this the correct use of Unicode::Collate? by flexvault (Monsignor) on Jan 17, 2012 at 19:05 UTC
Re^3: RFC: Is this the correct use of Unicode::Collate? by Jim (Curate) on Jun 24, 2012 at 02:13 UTC
A "common" practice for handling duplicate names in a database is to append non-printable characters after the name, in the order of insertion. What you need is an invisible letter in Unicode. Just such a letter was proposed several years ago by typographer Michael Everson. His proposed name for the character was INVISIBLE LETTER. Unfortunately, the Unicode Consortium rejected his proposal. See Proposal to add INVISIBLE LETTER to the UCS and Every character has a story #11: U+???? (The Invisible Letter) If there were such an invisible Unicode character, you could do something like this: #!perl use strict; use warnings; use open qw( :std :encoding(UTF-8) ); use charnames qw( :full ); use Unicode::Collate; my $DISAMBIGUATOR_CHARACTER = "\N{LATIN SMALL LIGATURE FFL}"; # U+FB04 my %president_number_by; # President number by president name my %seen; while (<DATA>) { chomp; my ($name, $number) = split m/,/, $_, 2; $seen{$name} = exists $seen{$name} ? $seen{$name} . $DISAMBIGUATOR_CHARACTER : $name ; $president_number_by{$seen{$name}} = $number; } my $collator = Unicode::Collate->new(); for my $name ($collator->sort(keys %president_number_by)) { my $number = $president_number_by{$name}; $name =~ s/$DISAMBIGUATOR_CHARACTER+$//; print "$name,$number\n"; } exit 0; __DATA__ Washington,1 Adams,2 Jefferson,3 Madison,4 Monroe,5 Adams,6 Jackson,7 Van Buren,8 Harrison,9 Tyler,10 Polk,11 Taylor,12 Fillmore,13 Pierce,14 Buchanan,15 Lincoln,16 Johnson,17 Simpson,18 Hayes,19 Garfield,20 Arthur,21 Cleveland,22 Harrison,23 Cleveland,24 McKinley,25 Roosevelt,26 Taft,27 Wilson,28 Harding,29 Coolidge,30 Hoover,31 Roosevelt,32 Truman,33 Eisenhower,34 Kennedy,35 Johnson,36 Nixon,37 Ford,38 Carter,39 Reagan,40 Bush,41 Clinton,42 Bush,43 Obama,44 Bush,45 [download] This script produces this output: `Adams,2 Adams,6 Arthur,21 Buchanan,15 Bush,41 Bush,43 Bush,45 Carter,39 Cleveland,22 Cleveland,24 Clinton,42 Coolidge,30 Eisenhower,34 Fillmore,13 Ford,38 Garfield,20 Harding,29 Harrison,9 Harrison,23 Hayes,19 Hoover,31 Jackson,7 Jefferson,3 Johnson,17 Johnson,36 Kennedy,35 Lincoln,16 Madison,4 McKinley,25 Monroe,5 Nixon,37 Obama,44 Pierce,14 Polk,11 Reagan,40 Roosevelt,26 Roosevelt,32 Simpson,18 Taft,27 Taylor,12 Truman,33 Tyler,10 Van Buren,8 Washington,1 Wilson,28` [download] (For the purpose of demonstrating more than two presidents with the same last name, I had to assume Barack Obama is re-elected in 2012 and Jeb Bush is elected in 2016. I'm sorry if this prospect offends you.) This is a pure Unicode solution to the problem. There's no commingling of Unicode characters or graphemes with binary data. Unfortunately, however, there isn't a Unicode character with the general property L (Letter) that's guaranteed to be invisible. If there were, it would be just the right character to use for this "weirdo" purpose. Why did I use the Unicode character LATIN SMALL LIGATURE FFL in the demo script? I don't know exactly. Maybe because it's a character that collates high and seems impossibly unlikely ever to occur in real data. Jim	[reply] [d/l] [select]
Re^4: RFC: Is this the correct use of Unicode::Collate? by flexvault (Monsignor) on Jun 24, 2012 at 10:35 UTC
Re^5: RFC: Is this the correct use of Unicode::Collate? by Anonymous Monk on Jun 24, 2012 at 11:29 UTC
Some notes below your chosen depth have not been shown here
Re^5: RFC: Is this the correct use of Unicode::Collate? by Jim (Curate) on Jun 24, 2012 at 18:08 UTC