deibyz has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,

I'm using Text::Query::Advanced to let the user search in a number of documents. The problem I have is that most of these documents are written in Spanish (yes, I'm Spanish, that's the reason of my bad English ;)), and they have "funny" characters, i.e.: áéíóú... . The problem is that the search "camion" should match the word "camión" as well as "camion", so I'm trying to figure out a simple way to get rid of those characters.

A simple substitution may work:

s/á/a/g; s/é/e/g; ... s/Ú/u/g;

But that would make it too slow, as it would have to do lot of passes through the string (maybe a long string), and I have to be aware of more characters in a future (â, ä, à, ...)

I've tried the tr/áéíóú/aeiou/ solution, but as "á" is a two byte character, it doesn't work.

I've read perluniintro and perlunicode, but I've not found anything that can help me.

Any ideas are welcome.

Thanks in advance,

deibyz

Edited: Title changed.

Replies are listed 'Best First'.
Re: Playing with "funny" chars
by Eyck (Priest) on Sep 27, 2004 at 12:19 UTC

    I would just use /[aąAĄ]/ instead of just /a/ in patterns that you're using.

    Besides that, have you tried use locale?

    And you should remember to set LC_CTYPE or LC_ALL beforehand...

    Also, question about tr... could be solved using extended regexpes, that is - you match a class of chars, and for replacement you call routine that replaces this with correct char. This solves multiple passess problem.

Re: Playing with "funny" chars
by cog (Parson) on Sep 27, 2004 at 15:09 UTC
    I use this and it works:

    y[áàãâäÁÀÃÂÄéèêëÉÈÊËíìîïÍÌÎÏóòõôöÓÒÔÕÖúùûüÚÙÛÜçÇ] [aaaaaAAAAAeeeeEEEEiiiiIIIIoooooOOOOOuuuuUUUUcC]
      Could you please tell me some details about configuration, platform, etc...

      #!/usr/local/bin/perl use strict; use warnings; $a = 'áéíóú'; $a =~ tr{áéíóú} {aeiou}s; print $a; __OUTPUT__ aeaoauauau
      I'm using perl5.8.5 on RHAS (perl 5.8.0 come with the distro, but had problems with unicode).

      Thanks

        I'm using perl v5.8.4 on a Red Hat Linux 9.0

        Here's my complete script:

        #!/usr/bin/perl -pw use strict; y[áàãâäÁÀÃÂÄéèêëÉÈÊËíìîïÍÌÎÏóòõôöÓÒÔÕÖúùûüÚÙÛÜçÇ] [aaaaaAAAAAeeeeEEEEiiiiIIIIoooooOOOOOuuuuUUUUcC]

        Your script produces a correct output in my machine ("aeiou")...

Re: Playing with "funny" chars
by mischief (Hermit) on Sep 27, 2004 at 12:57 UTC
      (oops, I wanted to reply to the first post but clicked here by accident ;) ).

      My recommendation is to use perl 5.8.0 or more recent and look at perldoc Encode, perldoc open, and perldoc -f open. If tr doesn't work because you have the characters encoded in two bytes, you can do

      $s = decode_utf8($s);

      That will convert the string into the internal representation where characters are characters and you don't have to worry about how many bytes they need for encoding.

        I think the problem is not on the string (I'm using perl5.8.5, because 5.8.0 had some bugs in RedHat), but on the tr operator itself.

        The first attemp works like this:

        perl -e '$_="áéíóú";tr/áéíóú/aeiou/;print' aeaoauauau
        It seems that "á" is treated as two characters, maybe "´" and "a", and each one get one different matching char ( "a" and "e").

        BTW, encode and decode functions return values that make me think that the string is well formed, and that is tr// who's making wrong things. Am I too lost?

Re: Playing with extended chars
by chanio (Priest) on Sep 27, 2004 at 23:14 UTC
    May be it is something from outside perl. Have you configured your locale variables in your system. Try set and see your locales LC_ALL, etc...

    It happened to me (also Spanish) and it was a mess to distinguish if the problems was in perl or from my local environment variables. Please, check all your default variables before trying with perl's.

    use POSIX qw(strftime setlocale LC_ALL LC_CTYPE); my ($loc) = POSIX::setlocale( &POSIX::LC_ALL, 'es_ES.ISO8859-1' ); my ($now_string) = strftime "%a %b %e %H:%M:%S %Y", localtime; my ($fecBita) = strftime "%Y-%m-%d %H:%M:%S", localtime;
    Your code is Ok!

    .{\('v')/}
    _`(___)' __________________________
    Wherever I lay my KNOPPIX disk, a new FREE LINUX nation could be established.