Foreign language characters...

meetn2veg has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Foreign language characters... by helgi (Hermit) on Oct 08, 2002 at 15:04 UTC
Here's a shortened version of a method I have used in the past: `my @names = qw/Árni Óli Þorgeir Ýr Ægir Þór /; my @names = fix_chars(@names); print "$_\n", for @names; sub fix_chars { for (@_) { tr/ÁÐÉÍÓÚÝÖáðéíóúýö/ADEIOUYOadeiouyo/; s/Þ/Th/; s/Æ/Ae/; s/þ/th/; s/æ/ae/; s/\W/_/g; # Throw away any remaining non-word chars push @ok,$_; } return @ok; }` [download] This is intended for translating Icelandic into international. It uses transliteration to substitute single characters and substitution to fix the double letters. You can of course decide for yourself what characters you will allow and disallow. -- Regards, Helgi Briem helgi AT decode DOT is	[reply] [d/l]
Re2: Foreign language characters... by blakem (Monsignor) on Oct 09, 2002 at 04:27 UTC
The replacements are good, but the interface is a bit murky. It returns the "fixed" list, but it also modifies the original list. For instance, the output of your snippet is the same if you simply: `fix_chars(@names);` [download] instead of: `my @names = fix_chars(@names);` [download] I would rewrite it so that it either left the original list intact, or didn't return the "fixed" list. Something like: `sub fix_chars { my @fixed = @_; for (@fixed) { tr/ÁÐÉÍÓÚÝÖáðéíóúýö/ADEIOUYOadeiouyo/; ... } return @fixed; }` [download] -Blake	[reply] [d/l] [select]
Re: Foreign language characters... by John M. Dlugosz (Monsignor) on Oct 08, 2002 at 14:54 UTC
Perl has the complete set of Unicode data tables in Perl data structures. Try the Unicode::CharName module. The accented letters all mention the base char in the name, so you can look up the name for a character code and find that it's "LATIN SMALL LETTER A WITH RING ABOVE", then grep for what char is just "LATIN CAPITAL LETTER A". Use the Memoize module to remember the results so each lookup is only done the first time needed. —John	[reply]
Re: Foreign language characters... by Abigail-II (Bishop) on Oct 08, 2002 at 15:08 UTC
On Unix, the only forbidden characters are the NUL byte and the /. All others are legal, that is, if the code points are less than 256. However, many vendors have filesystems that (partially) support Unicode filenames - but you still can't use NUL and /. But why use such filenames in the first place? Abigail	[reply]
Re: Foreign language characters... by fglock (Vicar) on Oct 08, 2002 at 14:38 UTC
Convert from your character-set to plain ASCII. Use one of the character encoding modules (such as Encode.pm).	[reply]
Re: Foreign language characters... by seattlejohn (Deacon) on Oct 08, 2002 at 14:57 UTC
There are lots and lots of characters that can cause problems like this, and you really have no hope of trying to enumerate them all. You should think seriously about translating everything except a small subset of legal characters to underscores. That way you will catch accented characters, Unicode characters, undesirable sequences of characters such as `..` and `~` and `/`, and so on. Something like this would probably do:`$name =~ tr/-A-Za-z0-9/_/c;` I know this doesn't completely answer your question, because you asked how to translate accented characters to their unaccented versions, not how to replace them across the board. The big problem with the de-accenting approach is that there are a huge variety of accented characters you potentially have to deal with. What's more, I believe that the codes used for accented characters (which are not part of standard ASCII) will vary depending on what character set you are actually using. You didn't say explicitly, but I'm assuming you're creating these Web pages via CGI. Security is something you'll need to approach very seriously if you are doing things like writing files based on user-supplied filenames. You might also want to read up on taint mode and check out Ovid's CGI Course for some further hints.	[reply] [d/l]
Re: Foreign language characters... by talexb (Chancellor) on Oct 08, 2002 at 13:35 UTC
Sure, just translate all accented characters to unaccented characters. I'd use tr. I don't know about what codes to use though -- hopefully another Monk will be able to help you there. --t. alex but my friends call me T.	[reply]