Re: Diacritic-Insensitive and Case-Insensitve Sorting

It would be interesting to see how "monstrous" your temporary sort function is -- it might not be as bad as you think, or might at least be on the right track. If you use "tr///" to normalize the case and accents, it's not clear to me that this would be bad in terms of scalability or generality (assuming that your "general" case always involves latin1 characters). Using "s///" is likely to be slower, of course, so you certainly don't want to do that.

Is it the case that you have profiled the code and determined that your sort routine is really taking too much time? If you haven't tested the performance and confirmed that your sort is the bottleneck, it might not be worthwhile to try to optimize it (i.e. the real bottleneck might be somewhere else, like file i/o).

You say the set of strings to sort will number in the few hundreds of thousands -- but how big are the strings? Are you sure you need to tie them into a DB_File, as opposed to simply having them in a memory-resident hash?

Does your sort function look like the following? This is how I would do it, just off the top of my head -- I don't know how it would perform on large quantities of data. It might be more "optimal" to fold everything into a single big "tr///", but it just seemed quicker/easier to code it this way (mostly in terms of avoiding slips with too many or too few replacement characters on the RHS):

sub fold_sort {
    my ( $x, $y ) = ( $a, $b );
    for ( $x, $y ) {
        tr/A-Z/a-z/;
        tr/ÀÁÂÃÄÅàáâãäå/a/;
        tr/ÈÉÊËèéêë/e/;
        tr/ÌÍÎÏìíîï/i/;
        tr/ÒÓÔÕÖØòóôõöø/o/;
        tr/ÙÚÛÜùúûü/u/;
        tr/ÇçÑñÝýÿ/ccnnyyy/;
        s/ß/ss/g;
    }
    $x cmp $y;
}
[download]

(update: added the s///g; for the double-S symbol)

(My 5.8.1 under SuSE linux doesn't have any trouble with treating/keeping the latin1 data as-is, but folks with Red Hat and 5.8.0 might have to add a "use bytes" pragma to make it work.)

Comment on Re: Diacritic-Insensitive and Case-Insensitve Sorting Download Code