Why [[:alpha:]] doesn't involve diacritic characters in replace expression?

wk has asked for the wisdom of the Perl Monks concerning the following question:

Terr!

I have a little script, which should from textfile input grab out every non-alpha characters and pipe then all alphas line by line to another function. I tried with [[:alpha:]] and \w/\W, but both work with diacritics (aka umlaut-chars) in one context but not in other. So i made an simple example script to show my point:

#!/usr/bin/perl

use strict;
use utf8;
binmode STDIN, ":utf8";
binmode STDOUT, ":utf8";

my $str = "See on üks täppidega lause!"; # sample string, want to get 
+rid of spaces and exclamation mark, all other are alphas, and i reall
+y need them

# first printout ( how [[:alpha:]] works)
foreach my $mrk ( split(//, $str) ) {
        if ($mrk =~ /^[[:alpha:]]$/ ) {
                print "$mrk";
        }
}
print "\n\n";
# end of first printout


# second printout  ( how [[:alpha:]] doesn't works)
$str =~ s/[[:^alpha:]]//ig;
print "$str\n";
# end of second printout

exit(0);

__END__

First output:
Seeonükstäppidegalause

Second should be also same, but is:
Seeonkstppidegalause
[download]

Why replace does not know that diacritics are also alphas? Or is there something wrong in my code? I can see workaraound (to write my own replace, for example), but i'd like to get it work in standard way.

Perl is v5.8.8, in Kubuntu 8.04.

TIA,

Kõike hääd, WK

Comment on Why [[:alpha:]] doesn't involve diacritic characters in replace expression? Download Code

Replies are listed 'Best First'.
Re: Why [[:alpha:]] doesn't involve diacritic characters in replace expression? by JavaFan (Canon) on Oct 22, 2008 at 15:02 UTC
It doesn't explain why you are getting what you get, but the use of \w, \W, [:alpha:], [:word:], etc is problematic in Perl, if the string you're matching against contains characters in the range 128-255, and no characters above. Then \w, [:alpha:] and such *may* match the accented characters, depending whether the string matched against has the UTF-8 flag (for that string) on or not. That is, it will match if the UTF-8 flag is on. If it's off, than in some cases if the pattern matched with has the UTF-8 flag on \w will match against accented characters - but not always. And, finally, if there's no UTF-8 it will be your locale that decides it - if you have an active locale. Much safer is to use the appropriate Unicode attributes. \p{IsAlpha} will match all characters marked 'alpha' in the Unicode database, regardless of UTF-8 flags or locales. `my $_ = "See on üks täppidega lause!"; for (split //) {print if /\p{IsAlpha}/} say ""; $_ =~ s/\P{IsAlpha}//g; say; __END__ Seeonükstäppidegalause Seeonükstäppidegalause` [download]	[reply] [d/l]
Re^2: Why [[:alpha:]] doesn't involve diacritic characters in replace expression? by moritz (Cardinal) on Oct 22, 2008 at 15:09 UTC
It doesn't explain why you are getting what you get, but the use of \w, \W, `[:alpha:]`, `[:word:]`, etc is problematic in Perl, if the string you're matching against contains characters in the range 128-255, and no characters above To guard against that, you can use Unicode::Semantics, or utf8::upgrade on the string prior to matching. That said I'd stay with the Unicode properties in regexes, and only revert to locales if you really need language dependent behavior. For a task like identifying printable and non-printable characters, Unicode is most likely the better choice.	[reply] [d/l] [select]
Re^2: Why [[:alpha:]] doesn't involve diacritic characters in replace expression? by almut (Canon) on Oct 22, 2008 at 18:59 UTC
...regardless of UTF-8 flags This note could be slightly misleading. Unicode properties matching only works with either Perl character/unicode strings (internally represented as UTF-8), in which case Perl's UTF-8 flag must be on, or with Latin1 as a special case, when the UTF-8 flag is off. In other words, the setting of the UTF-8 flag definitely is important. If it's off, the string is simply treated as Latin1, which might not always be appropriate. Consider the unicode codepoint U+5F4C, for example (CJK Unified Ideograph). It's in the unicode properties category "OtherLetter", and thus also in "Alphabetic", which Perl does identify correctly when it's represented as a Perl character string with the UTF-8 flag on. When the same character is represented in some other (non-Latin1) encoding, e.g. Shift-JIS, Perl has no way of telling that it's in fact a letter. It just treats the respective bytes (i.e. the Shift-JIS two-byte representation `0x9c 0x5c`) as Latin1, looks up those codepoints in the unicode database (which gives "STRING TERMINATOR" (category "Control") for U+009C, and "BACKSLASH" (category "OtherPunctuation") for U+005C), and reaches the wrong conclusion, as of course none of those qualify as "Alphabetic"... `use Encode; use Devel::Peek; print "U+5F4C as a regular Perl character string (UTF-8 flag on):\n"; my $str = "\x{5F4C}"; Dump $str; printf "=> is%s alphabetic\n\n", $str =~ /\p{IsAlpha}/ ? "":"n't"; print "same character in Shift-JIS legacy encoding (UTF-8 flag off):\n +"; $str = encode("shiftjis", $str); Dump $str; printf "=> is%s alphabetic\n\n", $str =~ /\p{IsAlpha}/ ? "":"n't";` [download] `U+5F4C as a regular Perl character string (UTF-8 flag on): SV = PV(0x629098) at 0x645530 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x682d40 "\345\275\214"\0 [UTF8 "\x{5f4c}"] CUR = 3 LEN = 8 => is alphabetic same character in Shift-JIS legacy encoding (UTF-8 flag off): SV = PV(0x629098) at 0x645530 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x75b5d0 "\234\\"\0 CUR = 2 LEN = 8 => isn't alphabetic` [download]	[reply] [d/l] [select]
Re^2: Why [[:alpha:]] doesn't involve diacritic characters in replace expression? by wk (Novice) on Oct 22, 2008 at 16:02 UTC
Thank you, JavaFan! That's it. I looked for those Unicode markups and thought: no way this could help me, it's the same story... :)) And yes, there is problem with diacritics, which are in Latin1-charset but not those which are above it. For example 'š' worked great. Thank you for the solution! Kõike hääd. WK	[reply]
Re: Why [[:alpha:]] doesn't involve diacritic characters in replace expression? by Tanktalus (Canon) on Oct 22, 2008 at 22:48 UTC
How about if we use the character class for this? `$str =~ s/[^[:alpha:]]//g; #i isn't needed` [download] I think what you found is either a bug in perl, or a bug in the documentation. `[[:^alpha:]]` and `[^[:alpha:]]` should work the same by my reading of perlre. But, apparently, it isn't working the same.	[reply] [d/l] [select]
Re^2: Why [[:alpha:]] doesn't involve diacritic characters in replace expression? by graff (Chancellor) on Oct 25, 2008 at 17:32 UTC
Thank you (++) for pointing that out, Tanktalus. Your post prompted me to do an experiment, comparing the POSIX and unicode-based "alpha" patterns (presented here as a sequence of bash shell commands). Read more... (5 kB) The `[[:^blah:]]` syntax is described in perlre as being "a Perl extension" to the POSIX syntax. My experiment shows that this extension, as applied to ":alpha:", creates a "special" class of characters, which match both ":alpha:" and ":^alpha:" -- these happen to be the "Latin1 upper-table" code points that involve letter symbols. (Using the "normal" method for inverting character classes -- `[^[:alpha:]]` -- has the expected behavior of providing the exact complement of `[[:alpha:]]`.) Maybe this could be viewed as a "feature" of the ":^alpha:" syntax, but only if people know about it. Considering that it isn't explained as such (at all) in the perlre man page -- and since it clearly differs from `[^[:alpha:]]` -- I'd have to say it's more likely to be a bug. (My experiment used perl 5.8.8 built for darwin.)	[reply] [d/l] [select]
Re: Why [[:alpha:]] doesn't involve diacritic characters in replace expression? by Fletch (Bishop) on Oct 22, 2008 at 14:49 UTC
Not completely sure but I believe that in order to get `[:alpha:]` to grok accented characters you also need to use the locale pragma (in addition to utf8 as you've done). The cake is a lie. The cake is a lie. The cake is a lie.	[reply] [d/l]
Re^2: Why [[:alpha:]] doesn't involve diacritic characters in replace expression? by wk (Novice) on Oct 22, 2008 at 15:48 UTC
I use locale-pragma in my scripts, but this has no point for yours environment, so i let it out from my example script. Kõike hääd. WK	[reply]