in reply to Diacritic-Insensitive and Case-Insensitve Sorting
Is it the case that you have profiled the code and determined that your sort routine is really taking too much time? If you haven't tested the performance and confirmed that your sort is the bottleneck, it might not be worthwhile to try to optimize it (i.e. the real bottleneck might be somewhere else, like file i/o).
You say the set of strings to sort will number in the few hundreds of thousands -- but how big are the strings? Are you sure you need to tie them into a DB_File, as opposed to simply having them in a memory-resident hash?
Does your sort function look like the following? This is how I would do it, just off the top of my head -- I don't know how it would perform on large quantities of data. It might be more "optimal" to fold everything into a single big "tr///", but it just seemed quicker/easier to code it this way (mostly in terms of avoiding slips with too many or too few replacement characters on the RHS):
(update: added the s///g; for the double-S symbol)sub fold_sort { my ( $x, $y ) = ( $a, $b ); for ( $x, $y ) { tr/A-Z/a-z/; tr/ÀÁÂÃÄÅàáâãäå/a/; tr/ÈÉÊËèéêë/e/; tr/ÌÍÎÏìíîï/i/; tr/ÒÓÔÕÖØòóôõöø/o/; tr/ÙÚÛÜùúûü/u/; tr/ÇçÑñÝýÿ/ccnnyyy/; s/ß/ss/g; } $x cmp $y; }
(My 5.8.1 under SuSE linux doesn't have any trouble with treating/keeping the latin1 data as-is, but folks with Red Hat and 5.8.0 might have to add a "use bytes" pragma to make it work.)
|
|---|