Re: Diacritic-Insensitive and Case-Insensitve Sorting

That's an interesting question. I was hesitant to post my solution because I'm thinking someone will come along with a POSIX module solution that normalizes diacritic symbols to their base character automatically. But in case that response doesn't come along, I'll give it a go...

Use tr/// to transliterate characters with diacritic symbols into their base character. Put the transliteration into a function to keep your code clean. And then use a Schwartzian Transform on your sort to ensure that conversions only happen once, for improved speed, and so that your original text is left in tact.


sub sterilize {
    my $char = shift;
    $char =~ tr/Eיֹךֺט/e/;
    # All of your other transliterations go here too.
    return $char;
}


my @array = ( # Your stuff to be sorted goes here );

my @sorted;

@sorted = map { $_->[1] } 
        sort { ( $a->[0] cmp $b->[0] ) or ( $a->[1] cmp $b->[1] ) }
        map { [ sterilize($_), $_ ] } @array;
[download]

As you can see, there are still a few minor blanks you have to fill in for this to be fully functional. You'll have to work out how to apply this to your database-tied hash, and you'll need to add the transliterations for the other diacritic symbols that I didn't enumerate (I don't know how to type them). But the idea should be clear enough.

Good luck. ...now I'll sit back and watch for a more elegant solution.

Update: Added short-circuit to the sort routine to force a defined order in cases where sorted strings are rendered equal by stripping diacritics.

Dave

Comment on Re: Diacritic-Insensitive and Case-Insensitve Sorting Select or Download Code

Replies are listed 'Best First'.
Re: Re: Diacritic-Insensitive and Case-Insensitve Sorting by graff (Chancellor) on Jan 05, 2004 at 06:05 UTC
I was hesitant to post my solution because I'm thinking someone will come along with a POSIX module solution that normalizes diacritic symbols to their base character automatically. The same thought crossed my mind, in terms of using unicode character classes. I checked the perlfaqs in 5.8.1, and the perlunicode man page, and didn't find anything relevant, though I think there was a discussion about this sort of thing (removing accents) on the perl-unicode mail list within the last couple weeks. In any case, I would hesitate to look for that sort of solution -- there's a reasonably good chance that operations using unicode character classes will end up being slower than just doing plain old "tr///" on plain old latin1 bytes. That, combined with the fact that AM would need to convert everything to utf8 first, tends to make this somewhat unlikely to succeed as an "optimization". As for a POSIX (as opposed to unicode) module, I would guess that if someone decided to do this in "pure perl", it would end up as just a "tr///" statement...	[reply]

Replies are listed 'Best First'.

Re: Re: Diacritic-Insensitive and Case-Insensitve Sorting
by graff (Chancellor) on Jan 05, 2004 at 06:05 UTC

I was hesitant to post my solution because I'm thinking someone will come along with a POSIX module solution that normalizes diacritic symbols to their base character automatically.

The same thought crossed my mind, in terms of using unicode character classes. I checked the perlfaqs in 5.8.1, and the perlunicode man page, and didn't find anything relevant, though I think there was a discussion about this sort of thing (removing accents) on the perl-unicode mail list within the last couple weeks.

In any case, I would hesitate to look for that sort of solution -- there's a reasonably good chance that operations using unicode character classes will end up being slower than just doing plain old "tr///" on plain old latin1 bytes. That, combined with the fact that AM would need to convert everything to utf8 first, tends to make this somewhat unlikely to succeed as an "optimization".

As for a POSIX (as opposed to unicode) module, I would guess that if someone decided to do this in "pure perl", it would end up as just a "tr///" statement...

[reply]