Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

The Björk Situation

by SheridanCat (Pilgrim)
on Feb 15, 2006 at 17:36 UTC ( [id://530454]=perlquestion: print w/replies, xml ) Need Help??

SheridanCat has asked for the wisdom of the Perl Monks concerning the following question:

I've done some CPAN searching to no avail. I have to think I'm just not looking for the right thing, so perhaps someone can point me in the right direction.

I have a search requirement that someone be able to look up a musician's name either by it's native spelling or by it's Americanized equivalent. The example we tend to use is Björk. You should be able to look her up with that spelling (o with the umlaut) or as "bjork".

The search engine works fine and all is well. I just need to generate a keyword with the Americanized spelling so it can be indexed.

Does anyone know of an existing CPAN library that will help with this? Worst case is I build a hash lookup, but I'm lazy.

Thanks, Troy

Update:

Many thanks to those who have responded. This is very helpful all around.

Troy

Replies are listed 'Best First'.
Re: The Björk Situation
by friedo (Prior) on Feb 15, 2006 at 17:39 UTC
    For a similar problem I use Text::Unaccent which uses the useful iconv utility available on most OSes.
Re: The Björk Situation
by rhesa (Vicar) on Feb 15, 2006 at 18:44 UTC
Re: The Björk Situation
by wfsp (Abbot) on Feb 15, 2006 at 18:11 UTC
    This is how I do that.

    #!/bin/perl5 use strict; use warnings; my %acc = get_accent(); # ...later my $text = get_text(); $text =~ s/(.)/$acc{$1}?$acc{$1}:$1/eg; sub get_accent{ return qw( À A Á A Â A Ã A Ä A Å A Æ AE Ç C È E É E Ê E Ë E Ì I Í I Î I Ï I Ð TH Ñ N Ò O Ó O Ô O Õ O Ö O Ø O Ù U Ú U Û U Ü U Ý U Þ TH ß ss à a á a â a ã a ä a å a æ ae ç c è e é e ê e ë e ì i í i î i ï i ð th ñ n ò o ó o ô o õ o ö o ø o ù u ú u û u ü u ý y þ th ÿ y ); } sub get_text{ # get text :-) }

    Hope that helps.

      You can speed this up considerably by transliterating everything you can and then only substituting characters that need it.

      my $string = 'ÀÁÂÃÄÅàáâãäåÇçÈÉÊËèéêëÌÍÎÏìíîïÒÓÔÕÖØòóôõöøÑñÙÚÛÜùúûüÝÿýÆ +æÞþÐðß'; print deaccent($string); sub deaccent{ my $phrase = shift; return $phrase unless ($phrase =~ m/[\xC0-\xFF]/); $phrase =~ tr/ÀÁÂÃÄÅàáâãäåÇçÈÉÊËèéêëÌÍÎÏìíîïÒÓÔÕÖØòóôõöøÑñÙÚÛÜùúûü +Ýÿý/AAAAAAaaaaaaCcEEEEeeeeIIIIiiiiOOOOOOooooooNnUUUUuuuuYyy/; my %trans = ( 'Æ' => 'AE', 'æ' => 'ae', 'Þ' => 'TH', 'þ' => 'th', 'Ð' => 'TH', 'ð' => 'th', 'ß' => 'ss' ); $phrase =~ s/([ÆæÞþÐðß])/$trans{$1}/g; return $phrase; }

      Benchmarking puts it at about 6 times the speed. Moving the hash assignment outside the sub speeds both up about the same amount, they stay about 6:1 ratio.

      use Benchmark qw( cmpthese ); my $string = 'ÀÁÂÃÄÅàáâãäåÇçÈÉÊËèéêëÌÍÎÏìíîïÒÓÔÕÖØòóôõöøÑñÙÚÛÜùúûüÝÿýÆ +æÞþÐðß'; cmpthese( -5, { deaccent => sub { my $phrase = $string; return $phrase unless ($phrase =~ m/[\xC0-\xFF]/); $phrase =~ tr/ÀÁÂÃÄÅàáâãäåÇçÈÉÊËèéêëÌÍÎÏìíîïÒÓÔÕÖØòóôõöøÑñÙÚÛÜùúûü +Ýÿý/AAAAAAaaaaaaCcEEEEeeeeIIIIiiiiOOOOOOooooooNnUUUUuuuuYyy/; my %trans = ( 'Æ' => 'AE', 'æ' => 'ae', 'Þ' => 'TH', 'þ' => 'th', 'Ð' => 'TH', 'ð' => 'th', 'ß' => 'ss' ); $phrase =~ s/([ÆæÞþÐðß])/$trans{$1}/g; return $phrase; }, deaccent2 => sub{ my %acc = qw( À A Á A  A à A Ä A Å A Æ AE Ç C È E É E Ê E Ë E Ì I Í I Î I Ï I Ð TH Ñ N Ò O Ó O Ô O Õ O Ö O Ø O Ù U Ú U Û U Ü U Ý U Þ TH ß ss à a á a â a ã a ä a å a æ ae ç c è e é e ê e ë e ì i í i î i ï i ð th ñ n ò o ó o ô o õ o ö o ø o ù u ú u û u ü u ý y þ th ÿ y ); my $text = $string; $text =~ s/(.)/$acc{$1}?$acc{$1}:$1/eg; return $text; }, });

      Returns on my system:

                   Rate deaccent2  deaccent
      deaccent2  4316/s        --      -86%
      deaccent  30859/s      615%        --
      

      With data that has fewer accented characters, the disparity should grow much greater since it will short circuit if there are no characters to be transliterated.

        I thought I'd add Text::Unidecode in the mix:
        use Text::Unidecode; ... unidecode => sub { return unidecode($string) },
        The benchmark returns this on my system:
        Rate deaccent2 deaccent unidecode deaccent2 8614/s -- -83% -97% deaccent 50243/s 483% -- -81% unidecode 267338/s 3003% 432% --
      This problem is a lot harder than you think. æ sometimes becomes ae, sometimes becomes e, and sometimes becomes a. Which one is correct depends on the circumstances. For example, "encyclopædia" is normally written "encyclopaedia" in English and "encyclopedia" in American. And the name "Ælfred" is now normally written "Alfred".

      You also forgot about œ, the crossed-out l thing that is used in Polish, the dotless i from Turkish, and no doubt others that I can't think of right now.

        I agree. Many letters have more than one de-accentized version. For example, ö can be rewritten to o or oe depending on what language it was in and what media is used.

Re: The Björk Situation
by mattr (Curate) on Feb 17, 2006 at 11:59 UTC
    This week this subject came up on the wxperl list. A German programmer is going to use the word "farb" (color) for a color chooser he's making. There is a Wx::ColourDialog already and wxwidgets all seem to use the European spelling of "colour", however CPAN seems to use both color and colour at random. Perhaps search.cpan.org ought to take care of this sort of problem automatically?
      I await your proposal for how to automate this with great interest. I will be particularly interested to see how you automagically deal with modules which provide a correctly spelt method and an incorrectly spelt alias for that method, and modules which provide neither, instead using clr just to stop idiots from whining.
        Not sure if you are being tongue in cheek but.. For example with htdig and other search engines there is a plugin which allows words to be slightly "misspelled". A one character misspelling would match both color and colour. I don't remember but it seems there was also a list of commonly misspelled words, if not one could be made.

        Anyway as you suggest it is impossible to solve every case. However please note these are things being created by programmers to do common things. It is not brain surgery. Also it is an old problem. For example BSD's spell command apparently has a -d option to specify a hash of alternate spellings (the hlist file).

        Also guess what, dictionaries are published by people who study this sort of thing for a living and they usually note alternative spellings. Even WordNet which has a Perl module has both color and colour. So between currently available resources and software to be developed it does not in fact seem to be as daunting a task as all that. Not that I'm going to do it though! :)

    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: The Björk Situation
by brycen (Monk) on Oct 19, 2010 at 03:51 UTC
    Try this clip, if your code is in Unicode-land:
    # Function: translate_diacriticals() # # Remove diacritical marks (e.g. ümlauts, hebrew vowels, etc) # for use in fuzzy matches, or for avoiding excess information loss # when encoding to restricted character sets like ASCII. # # See also: # http://www.perlmonks.org/?node_id=835238 # http://en.wikipedia.org/wiki/Diacritic # http://en.wikipedia.org/wiki/Unicode_equivalence # http://unicode.org/reports/tr15/ # http://www.faqs.org/rfcs/rfc3454.html # sub translate_diacriticals($) { my $str = Unicode::Normalize::NFKD($_[0]); $str =~ s/\p{NonspacingMark}//g; return $str; }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://530454]
Approved by friedo
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2024-04-24 19:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found