Obviously if I want a character class that allows only alphabetical characters, that's easy to construct with "[a-zA-Z]". But that approach isn't as effective when programming with locales using Unicode. In that environment, \w automagically includes accented "word" characters. So it is difficult to construct a locale-portable character class representation of "word" characters that excludes numerics and underscore.
It seems to me that the \w character class definition is too broad. It would be easier to work with a more narrowly defined character class. For example, let's say there's a new character class called \a, which represents alpha characters only. If one wanted to create extend this imaginary character class of alpha characters to also include numeric characters, it would be sufficient to say, [\d\a]. Yet it's difficult to subtract items from predefined character classes. You can't say, [\w\D] if what you intend is "word characters minus numeric characters".
I'm curious as to why \w includes numeric and underscore characters. I'm also curious as to what would constitute a locale-friendly alpha-only character class (one that excludes numeric digits).
...seeking enlightenment...
Dave
In reply to Why does \w include numbers and underscore? by davido
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |