Re^3: The Björk Situation

Replies are listed 'Best First'.
Re^4: The Björk Situation by thundergnat (Deacon) on Feb 16, 2006 at 00:09 UTC
Actually, now that I've had a moment to look at it, unidecode DOESN'T fare so well, strictly from a speed point of view. You made the mistake of modifying $string directly so that in all but the first call, there are NO characters that need to be transliterated so it benchmarked much faster. Once that is fixed, it doesn't have such a big lead. (Actually, none at all ;-) ) `unidecode => sub{ my $text = $string; return unidecode($text); },` [download] Yields: Rate unidecode deaccent2 deaccent unidecode 6797/s -- -3% -87% deaccent2 6979/s 3% -- -86% deaccent 50687/s 646% 626% -- Never-the-less, unidecode probably IS the best choice as it handles Unicode up to \xFFFF not just up to \xFF.	[reply] [d/l]
Re^5: The Björk Situation by rhesa (Vicar) on Feb 16, 2006 at 00:30 UTC
You made the mistake of modifying $string directly so that in all but the first call, there are NO characters that need to be transliterated so it is much faster. Once that is fixed, it doesn't have such a big lead. Whoops! You're right, I hadn't expected it to modify $string in-place. I suppose that's due to Benchmark imposing a void context on the return. My lesson learned today: Never trust your own benchmarks :)	[reply]
Re^4: The Björk Situation by thundergnat (Deacon) on Feb 15, 2006 at 19:32 UTC
Good point. Though Text::Unidecode transliterates eth (ð) as d rather than the more generally accepted th. That's just quibbling though, you really shouldn't be using ANY of these functions lightly, since they destroy information and change the meaning of the text.	[reply]
Re^5: The Björk Situation by rhesa (Vicar) on Feb 15, 2006 at 19:52 UTC
More quibbling ;) http://en.wikipedia.org/wiki/Eth_(letter) says "the letter had its origin as a d with a cross-stroke added". I don't think d is such a bad transliteration then. In my view, it's the thorn (þ) that should become th. And in fact, Text::Unidecode does so. I do agree with you though that all these transliterations lose information. But that makes them well suited for internal representations, especially in text searches. Another advantage of Text::Unidecode is that it handles a lot more than what's in the Latin-1 supplement. This quote from the perldoc describes it best: "In other words, Unidecode's approach is broad (knowing about dozens of writing systems), but shallow (not being meticulous about any of them).". So for speed and generality, I'd recommend it. If you need precision, than transliteration may not be such a good idea altogether.	[reply]
Re^6: The Björk Situation by japhy (Canon) on Feb 15, 2006 at 22:03 UTC
Re-read that wikipedia entry, though: Ð and þ were replaced with th. Besides, "eth" represents the hard "th" sound (in "them") while "thorn" represents the soft "th" sound (in "thin"). Jeff `japhy` Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and `perl` hacker How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart	[reply]
Re^7: The Björk Situation by rhesa (Vicar) on Feb 15, 2006 at 22:40 UTC
Re^5: The Björk Situation by helgi (Hermit) on Feb 22, 2006 at 11:28 UTC
As an Icelander I just wish to point out that we always transliterate 'ð' as 'd', not 'th'. So, as usual, the standard Perl module does the right thing. -- Regards, Helgi Briem hbriem AT f-prot DOT com	[reply]
Re^6: The Björk Situation by DrHyde (Prior) on Feb 23, 2006 at 09:51 UTC
As someone who has been known to write in Anglo-Saxon on occasion, we usually transliterate 'ð' as 'th'. 'þ' is always transliterated as 'th'.	[reply]