comment on

I'm sure there is a good historical reason why the character class \w (used in regular expressions) includes (under ASCII) both A..Z, a..z, and 0..9, and the _ underscore character. I understand that at this point, there's no going back to more narrowly define \w without breaking billions of lines of code already out there. But I have often wondered why \w was implemented this way in the first place..

Obviously if I want a character class that allows only alphabetical characters, that's easy to construct with "[a-zA-Z]". But that approach isn't as effective when programming with locales using Unicode. In that environment, \w automagically includes accented "word" characters. So it is difficult to construct a locale-portable character class representation of "word" characters that excludes numerics and underscore.

It seems to me that the \w character class definition is too broad. It would be easier to work with a more narrowly defined character class. For example, let's say there's a new character class called \a, which represents alpha characters only. If one wanted to create extend this imaginary character class of alpha characters to also include numeric characters, it would be sufficient to say, [\d\a]. Yet it's difficult to subtract items from predefined character classes. You can't say, [\w\D] if what you intend is "word characters minus numeric characters".

I'm curious as to why \w includes numeric and underscore characters. I'm also curious as to what would constitute a locale-friendly alpha-only character class (one that excludes numeric digits).

...seeking enlightenment...

Dave

In reply to Why does \w include numbers and underscore? by davido

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.