It would be interesting to see how "monstrous" your temporary sort function is -- it might not be as bad as you think, or might at least be on the right track. If you use "tr///" to normalize the case and accents, it's not clear to me that this would be bad in terms of scalability or generality (assuming that your "general" case always involves latin1 characters). Using "s///" is likely to be slower, of course, so you certainly don't want to do that.

Is it the case that you have profiled the code and determined that your sort routine is really taking too much time? If you haven't tested the performance and confirmed that your sort is the bottleneck, it might not be worthwhile to try to optimize it (i.e. the real bottleneck might be somewhere else, like file i/o).

You say the set of strings to sort will number in the few hundreds of thousands -- but how big are the strings? Are you sure you need to tie them into a DB_File, as opposed to simply having them in a memory-resident hash?

Does your sort function look like the following? This is how I would do it, just off the top of my head -- I don't know how it would perform on large quantities of data. It might be more "optimal" to fold everything into a single big "tr///", but it just seemed quicker/easier to code it this way (mostly in terms of avoiding slips with too many or too few replacement characters on the RHS):

sub fold_sort { my ( $x, $y ) = ( $a, $b ); for ( $x, $y ) { tr/A-Z/a-z/; tr/ÀÁÂÃÄÅàáâãäå/a/; tr/ÈÉÊËèéêë/e/; tr/ÌÍÎÏìíîï/i/; tr/ÒÓÔÕÖØòóôõöø/o/; tr/ÙÚÛÜùúûü/u/; tr/ÇçÑñÝýÿ/ccnnyyy/; s/ß/ss/g; } $x cmp $y; }
(update: added the s///g; for the double-S symbol)

(My 5.8.1 under SuSE linux doesn't have any trouble with treating/keeping the latin1 data as-is, but folks with Red Hat and 5.8.0 might have to add a "use bytes" pragma to make it work.)


In reply to Re: Diacritic-Insensitive and Case-Insensitve Sorting by graff
in thread Diacritic-Insensitive and Case-Insensitve Sorting by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.