in reply to What Is A Word?
If you try to grab words with a regex, the table of Unicode properties in perlunicode can be very helpful. For example in many languages words consist of letters which can be matched with \pL or \p{Letter}, and marks (\pM or \p{Mark}).
(Marks are combining characters that can modify letters or other characters. A well-known example is the "combining grave accent", which turns an A into a À)
Update: To clarify this further: in this context the one thing that Unicode buys you is that you don't have to enumerate characters to build your character classes. That's a cumbersome task, and usually done wrong because there's a huge set of characters. It doesn't help you with your mental decision of what you consider a word.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: What Is A Word?
by Limbic~Region (Chancellor) on Jan 22, 2009 at 19:30 UTC |