in reply to Re: RFC: Is this the correct use of Unicode::Collate?
in thread RFC: Is this the correct use of Unicode::Collate?

tchrist,

A "common" practice for handling duplicate names in a database is to append non-printable characters after the name, in the order of insertion. This is like using base 32 (numbers 0 to 31 ) for appended characters. This allows duplicates and retains the order of insertion. You don't have a limit since when you fill the first character, you just add another as "\0" and continue from there. That would be broken with Unicode::Collate.

The implication in the article was that you could replace 'sort' with 'Unicode::Collate'.

Thank you

"Well done is better than well said." - Benjamin Franklin

  • Comment on Re^2: RFC: Is this the correct use of Unicode::Collate?

Replies are listed 'Best First'.
Re^3: RFC: Is this the correct use of Unicode::Collate?
by moritz (Cardinal) on Jan 17, 2012 at 15:39 UTC
    The implication in the article was that you could replace 'sort' with 'Unicode::Collate'.

    And that seems to be the real problem. sort isn't broken (that's just link baiting), and neither is Unicode::Collate. They just do different things.

    The article does say

    Fortunately, you don't have to come up with your own algorithm for dictionary sorting, because Perl provides a standard class to do this for you: Unicode::Collate

    So despite its title, it doesn't mandate UC to be a universal replacement for sort, but just for one application.

      moritz,

      But all the references in the article are related to data in databases. I goggled ASCII and UTF-8, and found many times "...UTF-8 uses one byte for any ASCII characters, which have the same code values in both UTF-8 and ASCII encoding...", so why are the 0 - 127 characters being redefined? I understand the complexity of the subject, but the designers of UTF-8 knew better than to mess with ASCII, and that is why UTF-8 enhances ASCII.

      'Unicode::Collate' is core, so it could be used a lot in the future, as it should be. But a lot of production environments will be affected if they don't know in advance that the code points of ASCII have been redefined.

      My hope was that someone would say 'ASCII => 1' will work like Perl 'sort' for ASCII characters and UTF-8, etc for anything above 127.

      Thank you

      "Well done is better than well said." - Benjamin Franklin

        'Unicode::Collate' is core, so it could be used a lot in the future, as it should be. But a lot of production environments will be affected if they don't know in advance that the code points of ASCII have been redefined.
        A text sort looks nothing at all like a code point sort. You seem to think that 7 bit code points should not sort as a text. That completely defaults the whole purpose.

        Watch here to see what really happens:

        $ perl -MUnicode::Collate -E 'for (Unicode::Collate->new->sort(map { c +hr } 0..127)) { say "chr ", ord, "\t", /\p{graph}/ ? $_ : "(unprinta +ble)" }' chr 0 (unprintable) chr 1 (unprintable) chr 2 (unprintable) chr 3 (unprintable) chr 4 (unprintable) chr 5 (unprintable) chr 6 (unprintable) chr 7 (unprintable) chr 8 (unprintable) chr 14 (unprintable) chr 15 (unprintable) chr 16 (unprintable) chr 17 (unprintable) chr 18 (unprintable) chr 19 (unprintable) chr 20 (unprintable) chr 21 (unprintable) chr 22 (unprintable) chr 23 (unprintable) chr 24 (unprintable) chr 25 (unprintable) chr 26 (unprintable) chr 27 (unprintable) chr 28 (unprintable) chr 29 (unprintable) chr 30 (unprintable) chr 31 (unprintable) chr 127 (unprintable) chr 9 (unprintable) chr 10 (unprintable) chr 11 (unprintable) chr 12 (unprintable) chr 13 (unprintable) chr 32 (unprintable) chr 96 ` chr 94 ^ chr 95 _ chr 45 - chr 44 , chr 59 ; chr 58 : chr 33 ! chr 63 ? chr 46 . chr 39 ' chr 34 " chr 40 ( chr 41 ) chr 91 [ chr 93 ] chr 123 { chr 125 } chr 64 @ chr 42 * chr 47 / chr 92 \ chr 38 & chr 35 # chr 37 % chr 43 + chr 60 < chr 61 = chr 62 > chr 124 | chr 126 ~ chr 36 $ chr 48 0 chr 49 1 chr 50 2 chr 51 3 chr 52 4 chr 53 5 chr 54 6 chr 55 7 chr 56 8 chr 57 9 chr 97 a chr 65 A chr 98 b chr 66 B chr 99 c chr 67 C chr 100 d chr 68 D chr 101 e chr 69 E chr 102 f chr 70 F chr 103 g chr 71 G chr 104 h chr 72 H chr 105 i chr 73 I chr 106 j chr 74 J chr 107 k chr 75 K chr 108 l chr 76 L chr 109 m chr 77 M chr 110 n chr 78 N chr 111 o chr 79 O chr 112 p chr 80 P chr 113 q chr 81 Q chr 114 r chr 82 R chr 115 s chr 83 S chr 116 t chr 84 T chr 117 u chr 85 U chr 118 v chr 86 V chr 119 w chr 87 W chr 120 x chr 88 X chr 121 y chr 89 Y chr 122 z chr 90 Z
        See? A text sort looks nothing whatsoever like a code point sort. If you expect the UCA to do a code-point sort on 7-bit code points but a text sort on everything else, I fear that you have gravely misunderstood its purpose and consequences.

        So what may I do to help you understand this better? I would seriously like to know. chr 40 ( chr 41 ) chr 91

Re^3: RFC: Is this the correct use of Unicode::Collate?
by tchrist (Pilgrim) on Jan 17, 2012 at 16:19 UTC
    A "common" practice for handling duplicate names in a database is to append non-printable characters after the name, in the order of insertion. This is like using base 32 (numbers 0 to 31 ) for appended characters. This allows duplicates and retains the order of insertion. You don't have a limit since when you fill the first character, you just add another as "\0" and continue from there. That would be broken with Unicode::Collate.

    The implication in the article was that you could replace 'sort' with 'Unicode::Collate'.

    I’m afraid you’ve swapped my implication with your inference, as I implied no such thing — and what you’ve inferred in no way follows from what I wrote. Quoting myself, I wrote:
    If you have code that purports to sort text that looks like this:
    @sorted_lines = sort @lines;
    Then all you have to get a dictionary sort is write instead:
    use Unicode::Collate; @sorted_lines = Unicode::Collate::->new->sort(@lines);
    See the red part? Clearly, you do not have ‘code that purports to sort text’! Therefore, nothing I wrote applies to you.

    You have code that blindly does a mindless numeric sort on code points, not an alphabetic sort on text. What you are doing is not an alphabetic sort. Plus sorting of textual representations of numbers is specifically outside the scope of the UCA.

    Of course it’s trivial to modify the UCA sort to take care of your weirdo situation, such that it does a proper text sort on the text and a weirdo binary sort on the binary. But you have to tell it to do that. It doesn’t play mind games with you; here as always, one has to know what one is doing, and why.

      tchrist,

        See the red part?

      I re-checked and you are correct about the red part, and I was wrong for quoting you out of context. I apologize.

        Of course it’s trivial to modify the UCA sort to take care of your weirdo situation, such that it does a proper text sort on the text and a weirdo binary sort on the binary. But you have to tell it to do that. It doesn’t play mind games with you; here as always, one has to know what one is doing, and why.

      Do I understand you correctly that it can be done? I have read the docs on CPAN and the perldoc on my system, and I don't see how to do this. I know you think my request is "...weirdo binary sort on the..." ASCII, but I could give many instances of real-life uses where both text and binary co-exist and require sorting. One example: a desktop calendar program where all events are in a database server. The key part of key/value pair, would contain binary ASCII data(time, duration, etc) as well as the title for the event and possible sequencing information (base 32). The data value would be a description of the event. No sorting required for that and it could be UTF-nn or ASCII. The database engine doesn't care about the data portion, only the key matters.

      It would be wonderful if the database engine could sort the key information so the language of the title was handled correctly and the ASCII portion is also handled correctly.

      Thank you

      "Well done is better than well said." - Benjamin Franklin

Re^3: RFC: Is this the correct use of Unicode::Collate?
by Jim (Curate) on Jun 24, 2012 at 02:13 UTC
    A "common" practice for handling duplicate names in a database is to append non-printable characters after the name, in the order of insertion.

    What you need is an invisible letter in Unicode. Just such a letter was proposed several years ago by typographer Michael Everson. His proposed name for the character was INVISIBLE LETTER. Unfortunately, the Unicode Consortium rejected his proposal. See Proposal to add INVISIBLE LETTER to the UCS and Every character has a story #11: U+???? (The Invisible Letter)

    If there were such an invisible Unicode character, you could do something like this:

    #!perl use strict; use warnings; use open qw( :std :encoding(UTF-8) ); use charnames qw( :full ); use Unicode::Collate; my $DISAMBIGUATOR_CHARACTER = "\N{LATIN SMALL LIGATURE FFL}"; # U+FB04 my %president_number_by; # President number by president name my %seen; while (<DATA>) { chomp; my ($name, $number) = split m/,/, $_, 2; $seen{$name} = exists $seen{$name} ? $seen{$name} . $DISAMBIGUATOR_CHARACTER : $name ; $president_number_by{$seen{$name}} = $number; } my $collator = Unicode::Collate->new(); for my $name ($collator->sort(keys %president_number_by)) { my $number = $president_number_by{$name}; $name =~ s/$DISAMBIGUATOR_CHARACTER+$//; print "$name,$number\n"; } exit 0; __DATA__ Washington,1 Adams,2 Jefferson,3 Madison,4 Monroe,5 Adams,6 Jackson,7 Van Buren,8 Harrison,9 Tyler,10 Polk,11 Taylor,12 Fillmore,13 Pierce,14 Buchanan,15 Lincoln,16 Johnson,17 Simpson,18 Hayes,19 Garfield,20 Arthur,21 Cleveland,22 Harrison,23 Cleveland,24 McKinley,25 Roosevelt,26 Taft,27 Wilson,28 Harding,29 Coolidge,30 Hoover,31 Roosevelt,32 Truman,33 Eisenhower,34 Kennedy,35 Johnson,36 Nixon,37 Ford,38 Carter,39 Reagan,40 Bush,41 Clinton,42 Bush,43 Obama,44 Bush,45

    This script produces this output:

    Adams,2 Adams,6 Arthur,21 Buchanan,15 Bush,41 Bush,43 Bush,45 Carter,39 Cleveland,22 Cleveland,24 Clinton,42 Coolidge,30 Eisenhower,34 Fillmore,13 Ford,38 Garfield,20 Harding,29 Harrison,9 Harrison,23 Hayes,19 Hoover,31 Jackson,7 Jefferson,3 Johnson,17 Johnson,36 Kennedy,35 Lincoln,16 Madison,4 McKinley,25 Monroe,5 Nixon,37 Obama,44 Pierce,14 Polk,11 Reagan,40 Roosevelt,26 Roosevelt,32 Simpson,18 Taft,27 Taylor,12 Truman,33 Tyler,10 Van Buren,8 Washington,1 Wilson,28

    (For the purpose of demonstrating more than two presidents with the same last name, I had to assume Barack Obama is re-elected in 2012 and Jeb Bush is elected in 2016. I'm sorry if this prospect offends you.)

    This is a pure Unicode solution to the problem. There's no commingling of Unicode characters or graphemes with binary data. Unfortunately, however, there isn't a Unicode character with the general property L (Letter) that's guaranteed to be invisible. If there were, it would be just the right character to use for this "weirdo" purpose.

    Why did I use the Unicode character LATIN SMALL LIGATURE FFL in the demo script? I don't know exactly. Maybe because it's a character that collates high and seems impossibly unlikely ever to occur in real data.

    Jim

      Jim,

      Thank you for you input. You seem to know quite a bit about Unicode.

      What I tried to ask in the original post was why 'use Unicode::Collate;' changed the meaning of characters 0..31? Everything I have read, talked about not changing the meaning of 7bit ASCII.

      History of the question:

      I don't know if you are familiar with the NoSQL database engine BerkeleyDB (now owned by Oracle), but I have written a pure perl replacement that performs as well. In some cases where the data portion of the key/value pair are very large, it outperforms BerkeleyDB.

      Most people on this forum, believe that BerkeleyDB is free. Oracle has added some conditions that make it very expensive( our law firm's counsel ). One example: If a company employee downloads BerkeleyDB and installs it, that's okay. But as a software vendor, if I download it and install it, the company owes Oracle a fee based on number of cores and type of box. For a power7 IBM p-series with 32 cores, the license fee is $ 48,000. for the "free" BerkeleyDB.

      Most of our products sell for under $ 5,000. Hard to ask a company to pay an additional $48K.

      Since the PurePerlDB already exists, I was looking at adding a feature to use Unicode::Collate, but it broke other features of PurePerlDB. Unfortunately, my only solution now was to put the burden on the software developer to handle Unicode and duplicates, which is the same as BerkeleyDB.

      Thanks again for your input...Ed

      "Well done is better than well said." - Benjamin Franklin

        Most people on this forum, believe that BerkeleyDB is free. Oracle has added some conditions that make it very expensive( our law firm's counsel ). One example: If a company employee downloads BerkeleyDB and installs it, that's okay. But as a software vendor, if I download it and install it, the company owes Oracle a fee based on number of cores and type of box. For a power7 IBM p-series with 32 cores, the license fee is $ 48,000. for the "free" BerkeleyDB.

        Just in case anyone was wondering about it, see my take on it in Open Source License for Berkeley DB unchanged

        The situation hasn't changed with the latest Berkeley db-5.3.21 , license is essentialy the same, though there is an addition of ASM for Java (only affects java bits, doesn't affect distribution / pricing )

        But I'm not a businessman or a lawyer or work for oracle


        regarding http://www.flexbasedb.com/, I notice you don't provide html only pdf, minor hassle

        For anyone interested about PurePerlDB/FlexBaseDB, from http://www.flexbasedb.com/FlexBaseDB_Introduction.pdf

        use strict; use warnings; use FlexBaseDB; my $dirname = '/home/FlexBaseDB'; unlink glob("$dirname/*"); my $fbenv = FB_OpenENV ( EnvHome => $dirname ); ## Directory for database(s) if ( ! $fbenv ) { die "FB_OpenENV: Bad ENV\n"; } my $filename = "TestDB"; ## Test file name in Environment! my $fb = FB_OpenDB ( FB_Name => $filename, ## Unique name of database FB_ENV => $fbenv, ## reference from FB_OpenENV ); if ( ! $fb ) { die "FB_OpenDB: Bad FILE\n"; } my $key = "Hello"; my $data = "World, we're here!"; my $ret; for my $count ( 1..5 ) { $ret = FB_Write( $fb,\"$key-$count",\$data ); if ( $ret==FALSE ) { die "Write failed $FB_Error \n"; } } if ( FB_Seek( $fb,\$key, FB_FIRST ) ) { print "\nOutput:\n\n"; while( $ret ) { $key = ""; $data = ""; $ret = FB_ReadNext( $fb,\$key,\$data ); print "$key\t$data\n"; } } print "\n","=" x 54, "\n"; ## Will print statistics for your DB my @results = FB_Stat ( $fb ); for my $no ( 0 .. $#results ) { if ( substr($results[$no],0,1) eq "=" ) { $results[$no] = "=" x 54; } print "$results[$no]\n"; } print "=" x 54, "\n"; $ret = FB_CloseDB( $fb ); $ret = FB_CloseENV( $fbenv ); __END__ Output: Hello-1 World, we're here! Hello-2 World, we're here! Hello-3 World, we're here! Hello-4 World, we're here! Hello-5 World, we're here!
        I don't know if you are familiar with the NoSQL database engine BerkeleyDB (now owned by Oracle), but I have written a pure perl replacement that performs as well. In some cases where the data portion of the key/value pair are very large, it outperforms BerkeleyDB.

        I'm familiar with NoSQL and key-value stores such as Berkeley DB. But what I'd never heard of before reading your PerlMonks post is the idiom—the trick—of modifying data to disambiguate otherwise identical keys by appending control codes or invisible characters to them. This idiom seems "weirdo" to me, just as it did to Tom, who first invoked the word to describe it.

        Is my example Perl script a fair representation of the idiom your NoSQL database software uses to disambiguate like keys?

        I'm not a database theory guru or a database programming wizard, but my gut sense is that the idiom you describe of ornamenting data with invisible control codes or other characters is fraught with problems. I understand how data modified this way would ensure uniqueness and preserve insertion order. But how then do you match such modified strings? Isn't there a better way to achieve the same objectives without altering data? Do other NoSQL database engines besides yours use this same idiom? If so, which ones?

        Jim