Definition

It is common for seekers of wisdom to ask seemingly innocuous questions like "How can I find the longest word in a string". When the seeker is posed the counter question "What is your definition of a word", they tend to react as if it is the most obvious thing in the world and that you are intentionally being unhelpful. The problem is they instinctively know how to recognize a word but don't realize the difficulty in lexing "words" from a string using code. In order to help, you will need to enlighten the seeker with some examples:

Pitfalls

The seeker points out that perl has a character class just for this purpose \w and indicates that is their definition

$str = 'The president-elect is a "tough guy", he thinks the rulez don' +t apply to th_500X. my @words = $str =~ /(\w+)/g; __DATA__ president-elect vs president elect tough guy vs "tough guy" rulez (typo?) don t vs don't th_500x (nonsensical word)

You will have to point out the pitfalls of any simplistic approach.

If the $str is formatted text, it is possible that a word may be split across newlines or even page breaks. If the seeker is looking for a specific "word", you have to point that if their word is "car" then they probably don't want to match "scared". The seeker may want to define their own character class - perhaps to remove _ and 0-9 from \w but to add apostrophe and hyphen to match words like "don't" and "president-elect" and to not match words like "th_500X". You will need to point out that they are still going to match "Z-''-Z".

Perhaps $str is the result of optical_character_recognition and they will need to decide if "Hi there, my name1s T0m" should treat "name1s" as a unit, if it should extract up to the 1, if it should try and split that into two tokens, etc. Dealing with punctuation, foreign words, markup, encoding, etc can all throw additional monkey wrenches into the works.

Use A Tool

At this point, it should be obvious to the seeker that defining a word and lexing it are two difficult tasks. They may profess to know their input and don't really need to consider all of these edge cases. Hopefully that is the case and they will be able to use one of the simplistic approaches. More realistically, they can accept a margin of error for edge cases and the simple approach is still sufficient. Unfortunately, the situation may be analogous to someone asking how to parse HTML or XML with a regex. Sometimes it is ok but you would probably save yourself a headache if you just used a tool.

A quick search of the CPAN for "parse words" reveals Text::ParseWords and Text::Balanced. Of course, if you have to write your own lexer there is Parse::RecDescent, Parse::Yapp and friends. Using an external dictionary may be of help but it can also be a double edged sword. Unfortunately, I am unaware of a 1-size fits all silver bullet. The reason for using a tool (preferrably an extensible one) is so that you can easily add, remove, or modify "rules" to fit your needs so the code is reusable - even if you have to build the tool yourself.

Note

The reason I wrote this is so that when a seeker asks a question in the future, we have a node to point to - much like jdporter did with XY Problem. I wrote this as a meditation and not a tutorial because there was no instruction provide, just things to think about. If you have more to add, I welcome your thoughts.

Update 2009-01-23: It was pointed out to me in a private /msg that the seeker isn't always the one that has the definition. They should be encouraged to seek council from their boss, teacher, requirements author, etc rather than come up with their own definition if appropriate.

Cheers - L~R

Replies are listed 'Best First'.
Re: What Is A Word?
by moritz (Cardinal) on Jan 22, 2009 at 16:06 UTC
    I might add that in human language \w+ is usually both too broad (because it contains the underscore _ and digits, which usually don't appear in words in human language) and too narrow (don't), and in general word recognition is language dependent.

    If you try to grab words with a regex, the table of Unicode properties in perlunicode can be very helpful. For example in many languages words consist of letters which can be matched with \pL or \p{Letter}, and marks (\pM or \p{Mark}).

    (Marks are combining characters that can modify letters or other characters. A well-known example is the "combining grave accent", which turns an A into a À)

    Update: To clarify this further: in this context the one thing that Unicode buys you is that you don't have to enumerate characters to build your character classes. That's a cumbersome task, and usually done wrong because there's a huge set of characters. It doesn't help you with your mental decision of what you consider a word.

      moritz,
      I might add that in human language \w+ is usually both too broad (because it contains the underscore _ and digits, which usually don't appear in words in human language) and too narrow (don't), and in general word recognition is language dependent.

      I already mentioned that in different words - do you think it needs to be more clear?

      The seeker may want to define their own character class - perhaps to remove _ and 0-9 from \w but to add apostrophe and hyphen to match words like "don't" and "president-elect" and to not match words like "th_500X". You will need to point out that they are still going to match "Z-''-Z".

      Regarding the Unicode comment. I mentioned encoding and dealing with foreign languages in passing. The reason being is because the same pitfalls still happen. If you only want "real" words in some dictionary - properly handling unicode by itself is not going to fix the problem of "aaaaaa" and the grave accent counterpart failing.

      In other words, I am saying that what the seeker may want is very subjective and only they can answer the questions necessary to provide an adequate solution. Forgetting to mention encoding will complicate the problem but knowing about it won't necessarily make the problem go away - the entire picture is required.

      Thank you for your comment and the unicode link - a good tool indeed.

      Cheers - L~R