in reply to Re^3: The Björk Situation
in thread The Björk Situation

Good point. Though Text::Unidecode transliterates eth (ð) as d rather than the more generally accepted th. That's just quibbling though, you really shouldn't be using ANY of these functions lightly, since they destroy information and change the meaning of the text.

Replies are listed 'Best First'.
Re^5: The Björk Situation
by rhesa (Vicar) on Feb 15, 2006 at 19:52 UTC
    More quibbling ;)

    http://en.wikipedia.org/wiki/Eth_(letter) says "the letter had its origin as a d with a cross-stroke added". I don't think d is such a bad transliteration then.

    In my view, it's the thorn (þ) that should become th. And in fact, Text::Unidecode does so.

    I do agree with you though that all these transliterations lose information. But that makes them well suited for internal representations, especially in text searches.

    Another advantage of Text::Unidecode is that it handles a lot more than what's in the Latin-1 supplement. This quote from the perldoc describes it best: "In other words, Unidecode's approach is broad (knowing about dozens of writing systems), but shallow (not being meticulous about any of them).".

    So for speed and generality, I'd recommend it. If you need precision, than transliteration may not be such a good idea altogether.

      Re-read that wikipedia entry, though: Ð and þ were replaced with th. Besides, "eth" represents the hard "th" sound (in "them") while "thorn" represents the soft "th" sound (in "thin").

      Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
      How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
        That both eth and thorn were replaced with th in english by the Normans doesn't change the fact that the eth is the voiced version, while the thorn is the silent version. This distinction is still visible in the IPA symbol.
        I believe this distinction also shows in the hip spellings of "the" and "that" by "da" and "dat". I'd say that rappers would vote for Unidecode's decision ;)
Re^5: The Björk Situation
by helgi (Hermit) on Feb 22, 2006 at 11:28 UTC
    As an Icelander I just wish to point out that we always transliterate 'ð' as 'd', not 'th'.

    So, as usual, the standard Perl module does the right thing.


    --
    Regards,
    Helgi Briem
    hbriem AT f-prot DOT com
      As someone who has been known to write in Anglo-Saxon on occasion, we usually transliterate 'ð' as 'th'. 'þ' is always transliterated as 'th'.