in reply to The Björk Situation

This is how I do that.

#!/bin/perl5 use strict; use warnings; my %acc = get_accent(); # ...later my $text = get_text(); $text =~ s/(.)/$acc{$1}?$acc{$1}:$1/eg; sub get_accent{ return qw( À A Á A Â A Ã A Ä A Å A Æ AE Ç C È E É E Ê E Ë E Ì I Í I Î I Ï I Ð TH Ñ N Ò O Ó O Ô O Õ O Ö O Ø O Ù U Ú U Û U Ü U Ý U Þ TH ß ss à a á a â a ã a ä a å a æ ae ç c è e é e ê e ë e ì i í i î i ï i ð th ñ n ò o ó o ô o õ o ö o ø o ù u ú u û u ü u ý y þ th ÿ y ); } sub get_text{ # get text :-) }

Hope that helps.

Replies are listed 'Best First'.
Re^2: The Björk Situation
by thundergnat (Deacon) on Feb 15, 2006 at 19:02 UTC

    You can speed this up considerably by transliterating everything you can and then only substituting characters that need it.

    my $string = 'ÀÁÂÃÄÅàáâãäåÇçÈÉÊËèéêëÌÍÎÏìíîïÒÓÔÕÖØòóôõöøÑñÙÚÛÜùúûüÝÿýÆ +æÞþÐðß'; print deaccent($string); sub deaccent{ my $phrase = shift; return $phrase unless ($phrase =~ m/[\xC0-\xFF]/); $phrase =~ tr/ÀÁÂÃÄÅàáâãäåÇçÈÉÊËèéêëÌÍÎÏìíîïÒÓÔÕÖØòóôõöøÑñÙÚÛÜùúûü +Ýÿý/AAAAAAaaaaaaCcEEEEeeeeIIIIiiiiOOOOOOooooooNnUUUUuuuuYyy/; my %trans = ( 'Æ' => 'AE', 'æ' => 'ae', 'Þ' => 'TH', 'þ' => 'th', 'Ð' => 'TH', 'ð' => 'th', 'ß' => 'ss' ); $phrase =~ s/([ÆæÞþÐðß])/$trans{$1}/g; return $phrase; }

    Benchmarking puts it at about 6 times the speed. Moving the hash assignment outside the sub speeds both up about the same amount, they stay about 6:1 ratio.

    use Benchmark qw( cmpthese ); my $string = 'ÀÁÂÃÄÅàáâãäåÇçÈÉÊËèéêëÌÍÎÏìíîïÒÓÔÕÖØòóôõöøÑñÙÚÛÜùúûüÝÿýÆ +æÞþÐðß'; cmpthese( -5, { deaccent => sub { my $phrase = $string; return $phrase unless ($phrase =~ m/[\xC0-\xFF]/); $phrase =~ tr/ÀÁÂÃÄÅàáâãäåÇçÈÉÊËèéêëÌÍÎÏìíîïÒÓÔÕÖØòóôõöøÑñÙÚÛÜùúûü +Ýÿý/AAAAAAaaaaaaCcEEEEeeeeIIIIiiiiOOOOOOooooooNnUUUUuuuuYyy/; my %trans = ( 'Æ' => 'AE', 'æ' => 'ae', 'Þ' => 'TH', 'þ' => 'th', 'Ð' => 'TH', 'ð' => 'th', 'ß' => 'ss' ); $phrase =~ s/([ÆæÞþÐðß])/$trans{$1}/g; return $phrase; }, deaccent2 => sub{ my %acc = qw( À A Á A  A à A Ä A Å A Æ AE Ç C È E É E Ê E Ë E Ì I Í I Î I Ï I Ð TH Ñ N Ò O Ó O Ô O Õ O Ö O Ø O Ù U Ú U Û U Ü U Ý U Þ TH ß ss à a á a â a ã a ä a å a æ ae ç c è e é e ê e ë e ì i í i î i ï i ð th ñ n ò o ó o ô o õ o ö o ø o ù u ú u û u ü u ý y þ th ÿ y ); my $text = $string; $text =~ s/(.)/$acc{$1}?$acc{$1}:$1/eg; return $text; }, });

    Returns on my system:

                 Rate deaccent2  deaccent
    deaccent2  4316/s        --      -86%
    deaccent  30859/s      615%        --
    

    With data that has fewer accented characters, the disparity should grow much greater since it will short circuit if there are no characters to be transliterated.

      I thought I'd add Text::Unidecode in the mix:
      use Text::Unidecode; ... unidecode => sub { return unidecode($string) },
      The benchmark returns this on my system:
      Rate deaccent2 deaccent unidecode deaccent2 8614/s -- -83% -97% deaccent 50243/s 483% -- -81% unidecode 267338/s 3003% 432% --

        Actually, now that I've had a moment to look at it, unidecode DOESN'T fare so well, strictly from a speed point of view.

        You made the mistake of modifying $string directly so that in all but the first call, there are NO characters that need to be transliterated so it benchmarked much faster. Once that is fixed, it doesn't have such a big lead. (Actually, none at all ;-) )

        unidecode => sub{ my $text = $string; return unidecode($text); },
        Yields:
        
                     Rate unidecode deaccent2  deaccent
        unidecode  6797/s        --       -3%      -87%
        deaccent2  6979/s        3%        --      -86%
        deaccent  50687/s      646%      626%        --
        

        Never-the-less, unidecode probably IS the best choice as it handles Unicode up to \xFFFF not just up to \xFF.

        Good point. Though Text::Unidecode transliterates eth (ð) as d rather than the more generally accepted th. That's just quibbling though, you really shouldn't be using ANY of these functions lightly, since they destroy information and change the meaning of the text.

Re^2: The Björk Situation
by DrHyde (Prior) on Feb 16, 2006 at 09:42 UTC
    This problem is a lot harder than you think. æ sometimes becomes ae, sometimes becomes e, and sometimes becomes a. Which one is correct depends on the circumstances. For example, "encyclopædia" is normally written "encyclopaedia" in English and "encyclopedia" in American. And the name "Ælfred" is now normally written "Alfred".

    You also forgot about œ, the crossed-out l thing that is used in Polish, the dotless i from Turkish, and no doubt others that I can't think of right now.

      I agree. Many letters have more than one de-accentized version. For example, ö can be rewritten to o or oe depending on what language it was in and what media is used.