Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi
I use the following to match swedish words -
$sentence =~ /([A-Z\N{LATIN CAPITAL LETTER A WITH RING ABOVE}\N{LATIN +CAPITAL LETTER A WITH DIAERESIS}\N{LATIN CAPITAL LETTER O WITH DIAERE +SIS}\N{LATIN CAPITAL LETTER E WITH ACUTE}]+)/ig)

Is this efficient?
Also, how can I do this for swedish words? -
my $word = ucfirst(lc($word));
Thanks!!

Replies are listed 'Best First'.
Re: Unicode operations
by ikegami (Patriarch) on Jan 03, 2010 at 21:22 UTC

    Is this efficient?

    When compared with what?

    Also, how can I do [ucfirst(lc($word))] for swedish words?

    It should work as-is for Swedish words.

    use open ':std', ':locale'; use charnames ':full'; my $word = "\N{LATIN CAPITAL LETTER A WITH RING ABOVE}" . "\N{LATIN CAPITAL LETTER A WITH DIAERESIS}" . "\N{LATIN CAPITAL LETTER O WITH DIAERESIS}" . "\N{LATIN CAPITAL LETTER E WITH ACUTE}"; print($word, "\n"); print(ucfirst(lc($word)), "\n");
    ÅÄÖÉ
    Åäöé
    

    Of course, if the words are coming to you encoded (i.e. from a file handle), you need to decode them first.

    You probably won't run into this problem, but if the characters in the range U+0080..U+00FF are left unchanged, precede the expression with

    utf8::upgrade( $word );

    That bug will be fixed in 5.12 (although it might require use 5.012;).

Re: Unicode operations
by Khen1950fx (Canon) on Jan 03, 2010 at 20:46 UTC