Guano has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone! This is my first posting but hopefully not my last. I have a small problem with matching Swedish characters i Perl. I'm developing a simple database script that recieves a string from an html-form and matches the string against every line in a file. If there's a match I print it. The code goes something like this:
foreach (@lines) { s/\n$//; @items = split /$field_separator/; $item_str = $items[$field-1]; if ( $item_str =~ m/^$search_str$/i) { # We have a match, do something! } }
The case-insensitive matching works fine, as long as the strings don't contain any special swedish characters, like å, ä or ö. If $item_str contains the character ä and $search_str the character Ä I dont get a match. It is very important that this works (most of our customers are Swedish) and I know there is an easy solution to my problem. I just can't remember it. I hope someone can help me.

Jocke

Replies are listed 'Best First'.
Re: Problem matching non-english chars
by btrott (Parson) on Apr 18, 2000 at 20:54 UTC
    To expand a bit on chromatic's answer: you most likely want to use locales. From reading a bit of perllocale, it looks like the "use locale" pragma changes the way certain functions and operators think about characters (and numbers, etc.). So, for example, in your case you'd want ä to be the lower-case version of Ä.

    Normally that's not the case, but if you use locales, you can force Perl to think of the characters that way. Regular expressions and case-modification functions are some of the functions modified by using locales, so you could use your case-insensitive regexp, or you could use functions like lc and uc on your strings, then do the comparison.

    For example, I tried this on my local system. It's going to be different on yours, most likely, but this may give you the general idea:

    use locale; use POSIX qw/locale_h/; setlocale(LC_CTYPE, "sv"); my $search_str = "gläd"; my $item_str = "GLÄD"; if ($item_str =~ /^$search_str$/i) { print "Matched!"; }
    So, for me, the locale I set was "sv" (Sweden); this may differ slightly for you, as apparently the names aren't very standardized. perllocale suggests the following command lines to find the locale list:
    locale -a nlsinfo ls /usr/lib/nls/loc ls /usr/lib/locale ls /usr/lib/nls
    Some of these probably won't work, but hopefully, some will.
      Also very helpful... I haven't had the time to look in to locales yet, but your example is very close to what I had in mind. Even if the solution looks a little bit different on my system, you have all saved me a lot of headache. I'm very grateful for that!
Re: Problem matching non-english chars
by chromatic (Archbishop) on Apr 18, 2000 at 18:55 UTC
      Yes! That is exactly what I was looking for. I just couldn't remember the term 'locale'. Now, that I know it, I'm sure I'll find a suitable solution for my problem. Thanks.
Re: Problem matching non-english chars
by lhoward (Vicar) on Apr 18, 2000 at 22:22 UTC
    This is a far-from-ideal solution, but its quick (though dirty) and will generally work right. It doesn't work at all for characters that are outside the English character set, but will work for characters that are basically accented english characters. That being said, here it is:

    In short, you strip all accents from user entered search terms and database entries (at least the database entries that are used for searching, leaving the "display" database entries alone). i.e. map Ä to A, å to a, etc....

    So if someone tries to look up "Äbc" in your file, it would be converted to "Abc" before it tries to match on the file. Professionally I work with a lot of bands and find that that conversion is very handy. For example, most people will spell Moxy Früvous as "Moxy Fruvous" (without the umlaut over the u) when performing a search, so unless I did the conversion to both store and clean-down search terms by removing accents I would never find it.

    Now this is my very nassty looking translation statment that will replace accented characters (iso-8859-1) with their non-accented ascii equivalents.

    $s=~tr/\xC0\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\ +xCF\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD8\xD9\xDA\xDB\xDC\xDD\xDF\xE0\xE1\x +E2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF1\xF2\xF3\xF +4\xF5\xF6\xF8\xF9\xFA\xFB\xFC\xFD\xFF/\x41\x41\x41\x41\x41\x41\x41\x4 +3\x45\x45\x45\x45\x49\x49\x49\x49\x44\x4E\x4F\x4F\x4F\x4F\x4F\x4F\x55 +\x55\x55\x55\x59\x73\x61\x61\x61\x61\x61\x61\x61\x63\x65\x65\x65\x65\ +x69\x69\x69\x69\x6E\x6F\x6F\x6F\x6F\x6F\x6F\x75\x75\x75\x75\x79\x79/;
    As I said.. dirty and imperfect, but quick and generally works right....


    Les Howard
    www.lesandchris.com
    Author of Net::Syslog and Number::Spell

      There's nothing wrong with quick and dirty... If using locales doesn't work on this specific customer's system, I will probably try your solution instead. I had thought of doing something similar, but, I wanted to find out if there is a less "dirty" solution. Once again, thanks to all of you!

      Jocke