You can speed this up considerably by transliterating everything you can and then only substituting characters that need it.

my $string = 'ÀÁÂÃÄÅàáâãäåÇçÈÉÊËèéêëÌÍÎÏìíîïÒÓÔÕÖØòóôõöøÑñÙÚÛÜùúûüÝÿýÆ +æÞþÐðß'; print deaccent($string); sub deaccent{ my $phrase = shift; return $phrase unless ($phrase =~ m/[\xC0-\xFF]/); $phrase =~ tr/ÀÁÂÃÄÅàáâãäåÇçÈÉÊËèéêëÌÍÎÏìíîïÒÓÔÕÖØòóôõöøÑñÙÚÛÜùúûü +Ýÿý/AAAAAAaaaaaaCcEEEEeeeeIIIIiiiiOOOOOOooooooNnUUUUuuuuYyy/; my %trans = ( 'Æ' => 'AE', 'æ' => 'ae', 'Þ' => 'TH', 'þ' => 'th', 'Ð' => 'TH', 'ð' => 'th', 'ß' => 'ss' ); $phrase =~ s/([ÆæÞþÐðß])/$trans{$1}/g; return $phrase; }

Benchmarking puts it at about 6 times the speed. Moving the hash assignment outside the sub speeds both up about the same amount, they stay about 6:1 ratio.

use Benchmark qw( cmpthese ); my $string = 'ÀÁÂÃÄÅàáâãäåÇçÈÉÊËèéêëÌÍÎÏìíîïÒÓÔÕÖØòóôõöøÑñÙÚÛÜùúûüÝÿýÆ +æÞþÐðß'; cmpthese( -5, { deaccent => sub { my $phrase = $string; return $phrase unless ($phrase =~ m/[\xC0-\xFF]/); $phrase =~ tr/ÀÁÂÃÄÅàáâãäåÇçÈÉÊËèéêëÌÍÎÏìíîïÒÓÔÕÖØòóôõöøÑñÙÚÛÜùúûü +Ýÿý/AAAAAAaaaaaaCcEEEEeeeeIIIIiiiiOOOOOOooooooNnUUUUuuuuYyy/; my %trans = ( 'Æ' => 'AE', 'æ' => 'ae', 'Þ' => 'TH', 'þ' => 'th', 'Ð' => 'TH', 'ð' => 'th', 'ß' => 'ss' ); $phrase =~ s/([ÆæÞþÐðß])/$trans{$1}/g; return $phrase; }, deaccent2 => sub{ my %acc = qw( À A Á A  A à A Ä A Å A Æ AE Ç C È E É E Ê E Ë E Ì I Í I Î I Ï I Ð TH Ñ N Ò O Ó O Ô O Õ O Ö O Ø O Ù U Ú U Û U Ü U Ý U Þ TH ß ss à a á a â a ã a ä a å a æ ae ç c è e é e ê e ë e ì i í i î i ï i ð th ñ n ò o ó o ô o õ o ö o ø o ù u ú u û u ü u ý y þ th ÿ y ); my $text = $string; $text =~ s/(.)/$acc{$1}?$acc{$1}:$1/eg; return $text; }, });

Returns on my system:

             Rate deaccent2  deaccent
deaccent2  4316/s        --      -86%
deaccent  30859/s      615%        --

With data that has fewer accented characters, the disparity should grow much greater since it will short circuit if there are no characters to be transliterated.


In reply to Re^2: The Björk Situation by thundergnat
in thread The Björk Situation by SheridanCat

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.