meetn2veg has asked for the wisdom of the Perl Monks concerning the following question:

New Monk - please help me to avoid dirty habbits...

Got this thing I've written - user creates a webpage. The title of the page is used as the filename.html

Spaces are converted to underscores and a limit on the number of charachers is in place and working.

Just came across a problem - accents on letters (á,é,í,ó,ú etc..)

What's "the best" (or most efficient) way of filtering accented characters and replacing them with 'normal' non-accented characters to form a valid filename?

I've thought about a simple s/...\g etc, bot don't know the necessary ASCII codes for accented characters.

Can anyone shed some light for me please.

Regards
Richard.

Replies are listed 'Best First'.
Re: Foreign language characters...
by helgi (Hermit) on Oct 08, 2002 at 15:04 UTC
    Here's a shortened version of a method I have used in the past:

    my @names = qw/Árni Óli Þorgeir Ýr Ægir Þór /; my @names = fix_chars(@names); print "$_\n", for @names; sub fix_chars { for (@_) { tr/ÁÐÉÍÓÚÝÖáðéíóúýö/ADEIOUYOadeiouyo/; s/Þ/Th/; s/Æ/Ae/; s/þ/th/; s/æ/ae/; s/\W/_/g; # Throw away any remaining non-word chars push @ok,$_; } return @ok; }

    This is intended for translating Icelandic into international.

    It uses transliteration to substitute single characters and substitution to fix the double letters.

    You can of course decide for yourself what characters you will allow and disallow.

    --
    Regards,
    Helgi Briem
    helgi AT decode DOT is

      The replacements are good, but the interface is a bit murky. It returns the "fixed" list, but it also modifies the original list. For instance, the output of your snippet is the same if you simply:
      fix_chars(@names);
      instead of:
      my @names = fix_chars(@names);
      I would rewrite it so that it either left the original list intact, or didn't return the "fixed" list. Something like:
      sub fix_chars { my @fixed = @_; for (@fixed) { tr/ÁÐÉÍÓÚÝÖáðéíóúýö/ADEIOUYOadeiouyo/; ... } return @fixed; }

      -Blake

Re: Foreign language characters...
by John M. Dlugosz (Monsignor) on Oct 08, 2002 at 14:54 UTC
    Perl has the complete set of Unicode data tables in Perl data structures. Try the Unicode::CharName module. The accented letters all mention the base char in the name, so you can look up the name for a character code and find that it's "LATIN SMALL LETTER A WITH RING ABOVE", then grep for what char is just "LATIN CAPITAL LETTER A".

    Use the Memoize module to remember the results so each lookup is only done the first time needed.

    —John

Re: Foreign language characters...
by Abigail-II (Bishop) on Oct 08, 2002 at 15:08 UTC
    On Unix, the only forbidden characters are the NUL byte and the /. All others are legal, that is, if the code points are less than 256. However, many vendors have filesystems that (partially) support Unicode filenames - but you still can't use NUL and /.

    But why use such filenames in the first place?

    Abigail

Re: Foreign language characters...
by fglock (Vicar) on Oct 08, 2002 at 14:38 UTC

    Convert from your character-set to plain ASCII.

    Use one of the character encoding modules (such as Encode.pm).

Re: Foreign language characters...
by seattlejohn (Deacon) on Oct 08, 2002 at 14:57 UTC
    There are lots and lots of characters that can cause problems like this, and you really have no hope of trying to enumerate them all. You should think seriously about translating everything except a small subset of legal characters to underscores. That way you will catch accented characters, Unicode characters, undesirable sequences of characters such as .. and ~ and /, and so on. Something like this would probably do:$name =~ tr/-A-Za-z0-9/_/c; I know this doesn't completely answer your question, because you asked how to translate accented characters to their unaccented versions, not how to replace them across the board. The big problem with the de-accenting approach is that there are a huge variety of accented characters you potentially have to deal with. What's more, I believe that the codes used for accented characters (which are not part of standard ASCII) will vary depending on what character set you are actually using.

    You didn't say explicitly, but I'm assuming you're creating these Web pages via CGI. Security is something you'll need to approach very seriously if you are doing things like writing files based on user-supplied filenames. You might also want to read up on taint mode and check out Ovid's CGI Course for some further hints.

Re: Foreign language characters...
by talexb (Chancellor) on Oct 08, 2002 at 13:35 UTC

    Sure, just translate all accented characters to unaccented characters. I'd use tr. I don't know about what codes to use though -- hopefully another Monk will be able to help you there.

    --t. alex
    but my friends call me T.