The Bjцrk Situation

SheridanCat has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: The Bjцrk Situation by friedo (Prior) on Feb 15, 2006 at 17:39 UTC
For a similar problem I use Text::Unaccent which uses the useful iconv utility available on most OSes.	[reply]
Re: The Bjцrk Situation by rhesa (Vicar) on Feb 15, 2006 at 18:44 UTC
Text::Unidecode is your friend. Doesn't require iconv.	[reply]
Re: The Bjцrk Situation by wfsp (Abbot) on Feb 15, 2006 at 18:11 UTC
This is how I do that. #!/bin/perl5 use strict; use warnings; my %acc = get_accent(); # ...later my $text = get_text(); $text =~ s/(.)/$acc{$1}?$acc{$1}:$1/eg; sub get_accent{ return qw( А A Б A В A Г A Д A Е A Ж AE З C И E Й E К E Л E М I Н I О I П I Р TH С N Т O У O Ф O Х O Ц O Ш O Щ U Ъ U Ы U Ь U Э U Ю TH Я ss а a б a в a г a д a е a ж ae з c и e й e к e л e м i н i о i п i р th с n т o у o ф o х o ц o ш o щ u ъ u ы u ь u э y ю th я y ); } sub get_text{ # get text :-) } [download] Hope that helps.	[reply] [d/l]
Re^2: The Bjцrk Situation by thundergnat (Deacon) on Feb 15, 2006 at 19:02 UTC
You can speed this up considerably by transliterating everything you can and then only substituting characters that need it. my $string = 'АБВГДЕабвгдеЗзИЙКЛийклМНОПмнопТУФХЦШтуфхцшСсЩЪЫЬщъыьЭяэЖ +жЮюРрЯ'; print deaccent($string); sub deaccent{ my $phrase = shift; return $phrase unless ($phrase =~ m/[\xC0-\xFF]/); $phrase =~ tr/АБВГДЕабвгдеЗзИЙКЛийклМНОПмнопТУФХЦШтуфхцшСсЩЪЫЬщъыь +Эяэ/AAAAAAaaaaaaCcEEEEeeeeIIIIiiiiOOOOOOooooooNnUUUUuuuuYyy/; my %trans = ( 'Ж' => 'AE', 'ж' => 'ae', 'Ю' => 'TH', 'ю' => 'th', 'Р' => 'TH', 'р' => 'th', 'Я' => 'ss' ); $phrase =~ s/([ЖжЮюРрЯ])/$trans{$1}/g; return $phrase; } [download] Benchmarking puts it at about 6 times the speed. Moving the hash assignment outside the sub speeds both up about the same amount, they stay about 6:1 ratio. use Benchmark qw( cmpthese ); my $string = 'АБВГДЕабвгдеЗзИЙКЛийклМНОПмнопТУФХЦШтуфхцшСсЩЪЫЬщъыьЭяэЖ +жЮюРрЯ'; cmpthese( -5, { deaccent => sub { my $phrase = $string; return $phrase unless ($phrase =~ m/[\xC0-\xFF]/); $phrase =~ tr/АБВГДЕабвгдеЗзИЙКЛийклМНОПмнопТУФХЦШтуфхцшСсЩЪЫЬщъыь +Эяэ/AAAAAAaaaaaaCcEEEEeeeeIIIIiiiiOOOOOOooooooNnUUUUuuuuYyy/; my %trans = ( 'Ж' => 'AE', 'ж' => 'ae', 'Ю' => 'TH', 'ю' => 'th', 'Р' => 'TH', 'р' => 'th', 'Я' => 'ss' ); $phrase =~ s/([ЖжЮюРрЯ])/$trans{$1}/g; return $phrase; }, deaccent2 => sub{ my %acc = qw( А A Б A В A Г A Д A Е A Ж AE З C И E Й E К E Л E М I Н I О I П I Р TH С N Т O У O Ф O Х O Ц O Ш O Щ U Ъ U Ы U Ь U Э U Ю TH Я ss а a б a в a г a д a е a ж ae з c и e й e к e л e м i н i о i п i р th с n т o у o ф o х o ц o ш o щ u ъ u ы u ь u э y ю th я y ); my $text = $string; $text =~ s/(.)/$acc{$1}?$acc{$1}:$1/eg; return $text; }, }); [download] Returns on my system: Rate deaccent2 deaccent deaccent2 4316/s -- -86% deaccent 30859/s 615% -- With data that has fewer accented characters, the disparity should grow much greater since it will short circuit if there are no characters to be transliterated.	[reply] [d/l] [select]
Re^3: The Bjцrk Situation by rhesa (Vicar) on Feb 15, 2006 at 19:12 UTC
I thought I'd add Text::Unidecode in the mix: `use Text::Unidecode; ... unidecode => sub { return unidecode($string) },` [download] The benchmark returns this on my system: `Rate deaccent2 deaccent unidecode deaccent2 8614/s -- -83% -97% deaccent 50243/s 483% -- -81% unidecode 267338/s 3003% 432% --` [download]	[reply] [d/l] [select]
Re^4: The Bjцrk Situation by thundergnat (Deacon) on Feb 16, 2006 at 00:09 UTC
Re^5: The Bjцrk Situation by rhesa (Vicar) on Feb 16, 2006 at 00:30 UTC
Re^4: The Bjцrk Situation by thundergnat (Deacon) on Feb 15, 2006 at 19:32 UTC
Re^5: The Bjцrk Situation by rhesa (Vicar) on Feb 15, 2006 at 19:52 UTC
Some notes below your chosen depth have not been shown here
Re^5: The Bjцrk Situation by helgi (Hermit) on Feb 22, 2006 at 11:28 UTC
Some notes below your chosen depth have not been shown here
Re^2: The Bjцrk Situation by DrHyde (Prior) on Feb 16, 2006 at 09:42 UTC
This problem is a lot harder than you think. æ sometimes becomes ae, sometimes becomes e, and sometimes becomes a. Which one is correct depends on the circumstances. For example, "encyclopædia" is normally written "encyclopaedia" in English and "encyclopedia" in American. And the name "Ælfred" is now normally written "Alfred". You also forgot about œ, the crossed-out l thing that is used in Polish, the dotless i from Turkish, and no doubt others that I can't think of right now.	[reply]
Re^3: The Bjцrk Situation by ambrus (Abbot) on Feb 22, 2006 at 10:45 UTC
I agree. Many letters have more than one de-accentized version. For example, ц can be rewritten to o or oe depending on what language it was in and what media is used.	[reply]
Re: The BjГ¶rk Situation by mattr (Curate) on Feb 17, 2006 at 11:59 UTC
This week this subject came up on the wxperl list. A German programmer is going to use the word "farb" (color) for a color chooser he's making. There is a Wx::ColourDialog already and wxwidgets all seem to use the European spelling of "colour", however CPAN seems to use both color and colour at random. Perhaps search.cpan.org ought to take care of this sort of problem automatically?	[reply]
Re^2: The BjГ¶rk Situation by DrHyde (Prior) on Feb 20, 2006 at 09:50 UTC
I await your proposal for how to automate this with great interest. I will be particularly interested to see how you automagically deal with modules which provide a correctly spelt method and an incorrectly spelt alias for that method, and modules which provide neither, instead using clr just to stop idiots from whining.	[reply]
Re^3: The BjГ¶rk Situation by mattr (Curate) on Feb 20, 2006 at 14:30 UTC
Not sure if you are being tongue in cheek but.. For example with htdig and other search engines there is a plugin which allows words to be slightly "misspelled". A one character misspelling would match both color and colour. I don't remember but it seems there was also a list of commonly misspelled words, if not one could be made. Anyway as you suggest it is impossible to solve every case. However please note these are things being created by programmers to do common things. It is not brain surgery. Also it is an old problem. For example BSD's spell command apparently has a -d option to specify a hash of alternate spellings (the hlist file). Also guess what, dictionaries are published by people who study this sort of thing for a living and they usually note alternative spellings. Even WordNet which has a Perl module has both color and colour. So between currently available resources and software to be developed it does not in fact seem to be as daunting a task as all that. Not that I'm going to do it though! :)	[reply]
Re^4: The BjГ¶rk Situation by DrHyde (Prior) on Feb 22, 2006 at 10:51 UTC
Re^5: The BjГѓВ¶rk Situation by mattr (Curate) on Feb 22, 2006 at 16:11 UTC
Some notes below your chosen depth have not been shown here
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: The Bjцrk Situation by brycen (Monk) on Oct 19, 2010 at 03:51 UTC
Try this clip, if your code is in Unicode-land: # Function: translate_diacriticals() # # Remove diacritical marks (e.g. ьmlauts, hebrew vowels, etc) # for use in fuzzy matches, or for avoiding excess information loss # when encoding to restricted character sets like ASCII. # # See also: # http://www.perlmonks.org/?node_id=835238 # http://en.wikipedia.org/wiki/Diacritic # http://en.wikipedia.org/wiki/Unicode_equivalence # http://unicode.org/reports/tr15/ # http://www.faqs.org/rfcs/rfc3454.html # sub translate_diacriticals($) { my $str = Unicode::Normalize::NFKD($_[0]); $str =~ s/\p{NonspacingMark}//g; return $str; } [download]	[reply] [d/l]


We don't bite newbies here... much
	PerlMonks