in reply to What Is A Word?

I might add that in human language \w+ is usually both too broad (because it contains the underscore _ and digits, which usually don't appear in words in human language) and too narrow (don't), and in general word recognition is language dependent.

If you try to grab words with a regex, the table of Unicode properties in perlunicode can be very helpful. For example in many languages words consist of letters which can be matched with \pL or \p{Letter}, and marks (\pM or \p{Mark}).

(Marks are combining characters that can modify letters or other characters. A well-known example is the "combining grave accent", which turns an A into a À)

Update: To clarify this further: in this context the one thing that Unicode buys you is that you don't have to enumerate characters to build your character classes. That's a cumbersome task, and usually done wrong because there's a huge set of characters. It doesn't help you with your mental decision of what you consider a word.

Replies are listed 'Best First'.
Re^2: What Is A Word?
by Limbic~Region (Chancellor) on Jan 22, 2009 at 19:30 UTC
    moritz,
    I might add that in human language \w+ is usually both too broad (because it contains the underscore _ and digits, which usually don't appear in words in human language) and too narrow (don't), and in general word recognition is language dependent.

    I already mentioned that in different words - do you think it needs to be more clear?

    The seeker may want to define their own character class - perhaps to remove _ and 0-9 from \w but to add apostrophe and hyphen to match words like "don't" and "president-elect" and to not match words like "th_500X". You will need to point out that they are still going to match "Z-''-Z".

    Regarding the Unicode comment. I mentioned encoding and dealing with foreign languages in passing. The reason being is because the same pitfalls still happen. If you only want "real" words in some dictionary - properly handling unicode by itself is not going to fix the problem of "aaaaaa" and the grave accent counterpart failing.

    In other words, I am saying that what the seeker may want is very subjective and only they can answer the questions necessary to provide an adequate solution. Forgetting to mention encoding will complicate the problem but knowing about it won't necessarily make the problem go away - the entire picture is required.

    Thank you for your comment and the unicode link - a good tool indeed.

    Cheers - L~R