daviddhall has asked for the wisdom of the Perl Monks concerning the following question:

Ok, here's hopefully an interesting one for you guys:
Is there some nifty perl package that converts upper ascii characters such as È, Í, Ñ, or á to lower ascii characters "normal" characters like e, i, n or a? I'm trying to write some sort of directory generation program based on strings and those characters are throwing me for a loop.
I guess I could write a hack program that converts them manually, but was just curious. Seems like perl always has a somewhat elegant solution.
begin ass kissing=>Thanks for the help! You guys are the greatest!<=end ass kissing

Replies are listed 'Best First'.
Re: Upper Ascii characters
by alfie (Pilgrim) on Apr 03, 2001 at 10:49 UTC
    Just to make it clear - those characters aren't ASCII. ASCII are only those that have the 8th bit set to 0.

    That said, a converting routine will be quite tricky - for it completely depends on the locale you are using. Different characters have different codes in the different charsets. You will have to take locale into account. Then you can at least convert them to lowercase:

    use locale; $foo = "ÖÄÜ"; print lc($foo)."\n";
    This put's them just to lowercase - it doesn't convert them to ASCII characters. That is a thing I wouldn't do - the characters are different on purpose. They mean different things, so I might question your reason.
    --
    Alfie
Re: Upper Ascii characters
by snowcrash (Friar) on Apr 03, 2001 at 11:12 UTC
    Alfie is probably right, anyway there is a module out there called Text::Unaccent
    that may fit your needs. it seems to require libiconv to work.

    cheers
    snowcrash //////
Re: Upper Ascii characters
by cLive ;-) (Prior) on Apr 03, 2001 at 10:39 UTC
    OOO, the words "Can of worms" springs to mind here.

    Before anyone rushes off and writes a one liner (possibly with a reg exp, better ask Meow if that's OK :), there's one major question?

    "Are you working with text from only one character set?"

    Please feel free to correct me if I get this wrong but AFAIK, extended characters (ie those outside the 1-127 range) vary depending on the character set used. And, with the growth of unicode, this gets even more confusing...

    If you have sample text you can copy/paste from, you could just run it through a few regular expressions to clean up. As for modules, no idea, sorry :(

    my $string = "le café, a là carte, table d'hôte"; $string =~ s/[èéêë]/e/gs; $string =~ s/[àáâãäå]/e/gs; # etc etc
    Apologies to those whose browsers are using different character encoding in their browsers and aren't seeing eeee and aaaaaa with accents in the reg exps :)