Re: RFC: How to unaccent text?

Looking at Text::StripAccents source code, it seems quite inefficient: it splits the string in chars, loops over them replacing accented ones by their ASCII equivalent and then joins the string again.

A regular expresion substitution could do it:

my %table = ( 'À' => 'A', 'Á' => 'A', 'Â' => 'A', 'Ã' => 'A', 'Ä' => '
+A', 'Å' => 'A',
              'Ç' => 'C',
              'È' => 'E', 'É' => 'E', 'Ê' => 'E', 'Ë' => 'E',
              'Ì' => 'I', 'Í' => 'I', 'Î' => 'I', 'Ï' => 'I',
              'Ñ' => 'N',
              'Ò' => 'O', 'Ó' => 'O', 'Ô' => 'O', 'Õ' => 'O', 'Ö' => '
+O',
              'Ù' => 'U', 'Ú' => 'U', 'Û' => 'U', 'Ü' => 'U',
              'à' => 'a', 'á' => 'a', 'â' => 'a', 'ã' => 'a', 'ä' => '
+a', 'å' => 'a',
              'ç' => 'c',
              'è' => 'e', 'é' => 'e', 'ê' => 'e', 'ë' => 'e',
              'ì' => 'i', 'í' => 'i', 'î' => 'i', 'ï' => 'i',
              'ñ' => 'n',
              'ò' => 'o', 'ó' => 'o', 'ô' => 'o', 'õ' => 'o', 'ö' => '
+o',
              'ß' => 'ss',
              'ù' => 'u', 'ú' => 'u', 'û' => 'u', 'ü' => 'u',
              'ý' => 'y' );

sub strip_accents {
    my $str = shift;
    $str =~ s/([^\x00-\x7F])/$table{$1} || '?'/ge;
    $str
}
[download]

It's so simple that it makes me think if a module is actually required...

And BTW, "unaccenting" chars is not a unique transformation, it depends on the text language. For instance, in German 'ü' should be mapped to 'ue' (see Lingua::DE::ASCII), but in Spanish it should be mapped to 'u'.

Comment on Re: RFC: How to unaccent text? Download Code

Replies are listed 'Best First'.
Re^2: RFC: How to unaccent text? by bart (Canon) on Apr 11, 2007 at 10:32 UTC
Looking at Text::StripAccents source code, it seems quite inefficient: it splits the string in chars, loops over them replacing accented ones by their ASCII equivalent and then joins the string again. Ouch. That sounds to me like it could be improved, and probably without changing the API. So, it could be better in a next version... (if somebody lends the author a hand. It could be you.) It's so simple that it makes me think if a module is actually required... What about the datatable... Are you going to construct it by hand, every time? Or will you be using copy-and-paste? Make it a module, it's the perfect place for it. p.s. I suppose `tr///` would be a lot more efficient than `s///`, at least for single character replacements. You might benchmark it, to compare.	[reply] [d/l] [select]
Re^3: RFC: How to unaccent text? by salva (Canon) on Apr 11, 2007 at 10:55 UTC
What about the datatable... Are you going to construct it by hand, every time? Or will you be using copy-and-paste? Well, as I pointed in my previous reply, the transformation is not unique, there could be several variations, and including the table in the code is an easy way to ensure that the right one is used. For instance, Text::StripAccents converts 'ß' to 'ss', something unexpected for an spanish user like me. IMO, the right solution would be to create a set of language dependent modules similar to Lingua::DE::ASCII.	[reply]