in reply to Problem matching non-english chars

This is a far-from-ideal solution, but its quick (though dirty) and will generally work right. It doesn't work at all for characters that are outside the English character set, but will work for characters that are basically accented english characters. That being said, here it is:

In short, you strip all accents from user entered search terms and database entries (at least the database entries that are used for searching, leaving the "display" database entries alone). i.e. map Ä to A, å to a, etc....

So if someone tries to look up "Äbc" in your file, it would be converted to "Abc" before it tries to match on the file. Professionally I work with a lot of bands and find that that conversion is very handy. For example, most people will spell Moxy Früvous as "Moxy Fruvous" (without the umlaut over the u) when performing a search, so unless I did the conversion to both store and clean-down search terms by removing accents I would never find it.

Now this is my very nassty looking translation statment that will replace accented characters (iso-8859-1) with their non-accented ascii equivalents.

$s=~tr/\xC0\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\ +xCF\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD8\xD9\xDA\xDB\xDC\xDD\xDF\xE0\xE1\x +E2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF1\xF2\xF3\xF +4\xF5\xF6\xF8\xF9\xFA\xFB\xFC\xFD\xFF/\x41\x41\x41\x41\x41\x41\x41\x4 +3\x45\x45\x45\x45\x49\x49\x49\x49\x44\x4E\x4F\x4F\x4F\x4F\x4F\x4F\x55 +\x55\x55\x55\x59\x73\x61\x61\x61\x61\x61\x61\x61\x63\x65\x65\x65\x65\ +x69\x69\x69\x69\x6E\x6F\x6F\x6F\x6F\x6F\x6F\x75\x75\x75\x75\x79\x79/;
As I said.. dirty and imperfect, but quick and generally works right....


Les Howard
www.lesandchris.com
Author of Net::Syslog and Number::Spell

Replies are listed 'Best First'.
RE: Re: Problem matching non-english chars
by Guano (Initiate) on Apr 20, 2000 at 11:21 UTC
    There's nothing wrong with quick and dirty... If using locales doesn't work on this specific customer's system, I will probably try your solution instead. I had thought of doing something similar, but, I wanted to find out if there is a less "dirty" solution. Once again, thanks to all of you!

    Jocke