in reply to Re^2: RFC: Is this the correct use of Unicode::Collate?
in thread RFC: Is this the correct use of Unicode::Collate?

A "common" practice for handling duplicate names in a database is to append non-printable characters after the name, in the order of insertion. This is like using base 32 (numbers 0 to 31 ) for appended characters. This allows duplicates and retains the order of insertion. You don't have a limit since when you fill the first character, you just add another as "\0" and continue from there. That would be broken with Unicode::Collate.

The implication in the article was that you could replace 'sort' with 'Unicode::Collate'.

I’m afraid you’ve swapped my implication with your inference, as I implied no such thing — and what you’ve inferred in no way follows from what I wrote. Quoting myself, I wrote:
If you have code that purports to sort text that looks like this:
@sorted_lines = sort @lines;
Then all you have to get a dictionary sort is write instead:
use Unicode::Collate; @sorted_lines = Unicode::Collate::->new->sort(@lines);
See the red part? Clearly, you do not have ‘code that purports to sort text’! Therefore, nothing I wrote applies to you.

You have code that blindly does a mindless numeric sort on code points, not an alphabetic sort on text. What you are doing is not an alphabetic sort. Plus sorting of textual representations of numbers is specifically outside the scope of the UCA.

Of course it’s trivial to modify the UCA sort to take care of your weirdo situation, such that it does a proper text sort on the text and a weirdo binary sort on the binary. But you have to tell it to do that. It doesn’t play mind games with you; here as always, one has to know what one is doing, and why.

Replies are listed 'Best First'.
Re^4: RFC: Is this the correct use of Unicode::Collate?
by flexvault (Monsignor) on Jan 17, 2012 at 19:05 UTC

    tchrist,

      See the red part?

    I re-checked and you are correct about the red part, and I was wrong for quoting you out of context. I apologize.

      Of course it’s trivial to modify the UCA sort to take care of your weirdo situation, such that it does a proper text sort on the text and a weirdo binary sort on the binary. But you have to tell it to do that. It doesn’t play mind games with you; here as always, one has to know what one is doing, and why.

    Do I understand you correctly that it can be done? I have read the docs on CPAN and the perldoc on my system, and I don't see how to do this. I know you think my request is "...weirdo binary sort on the..." ASCII, but I could give many instances of real-life uses where both text and binary co-exist and require sorting. One example: a desktop calendar program where all events are in a database server. The key part of key/value pair, would contain binary ASCII data(time, duration, etc) as well as the title for the event and possible sequencing information (base 32). The data value would be a description of the event. No sorting required for that and it could be UTF-nn or ASCII. The database engine doesn't care about the data portion, only the key matters.

    It would be wonderful if the database engine could sort the key information so the language of the title was handled correctly and the ASCII portion is also handled correctly.

    Thank you

    "Well done is better than well said." - Benjamin Franklin