It is common for seekers of wisdom to ask seemingly innocuous questions like "How can I find the longest word in a string". When the seeker is posed the counter question "What is your definition of a word", they tend to react as if it is the most obvious thing in the world and that you are intentionally being unhelpful. The problem is they instinctively know how to recognize a word but don't realize the difficulty in lexing "words" from a string using code. In order to help, you will need to enlighten the seeker with some examples:
The seeker points out that perl has a character class just for this purpose \w and indicates that is their definition
$str = 'The president-elect is a "tough guy", he thinks the rulez don' +t apply to th_500X. my @words = $str =~ /(\w+)/g; __DATA__ president-elect vs president elect tough guy vs "tough guy" rulez (typo?) don t vs don't th_500x (nonsensical word)
You will have to point out the pitfalls of any simplistic approach.
If the $str is formatted text, it is possible that a word may be split across newlines or even page breaks. If the seeker is looking for a specific "word", you have to point that if their word is "car" then they probably don't want to match "scared". The seeker may want to define their own character class - perhaps to remove _ and 0-9 from \w but to add apostrophe and hyphen to match words like "don't" and "president-elect" and to not match words like "th_500X". You will need to point out that they are still going to match "Z-''-Z".
Perhaps $str is the result of optical_character_recognition and they will need to decide if "Hi there, my name1s T0m" should treat "name1s" as a unit, if it should extract up to the 1, if it should try and split that into two tokens, etc. Dealing with punctuation, foreign words, markup, encoding, etc can all throw additional monkey wrenches into the works.
At this point, it should be obvious to the seeker that defining a word and lexing it are two difficult tasks. They may profess to know their input and don't really need to consider all of these edge cases. Hopefully that is the case and they will be able to use one of the simplistic approaches. More realistically, they can accept a margin of error for edge cases and the simple approach is still sufficient. Unfortunately, the situation may be analogous to someone asking how to parse HTML or XML with a regex. Sometimes it is ok but you would probably save yourself a headache if you just used a tool.
A quick search of the CPAN for "parse words" reveals Text::ParseWords and Text::Balanced. Of course, if you have to write your own lexer there is Parse::RecDescent, Parse::Yapp and friends. Using an external dictionary may be of help but it can also be a double edged sword. Unfortunately, I am unaware of a 1-size fits all silver bullet. The reason for using a tool (preferrably an extensible one) is so that you can easily add, remove, or modify "rules" to fit your needs so the code is reusable - even if you have to build the tool yourself.
The reason I wrote this is so that when a seeker asks a question in the future, we have a node to point to - much like jdporter did with XY Problem. I wrote this as a meditation and not a tutorial because there was no instruction provide, just things to think about. If you have more to add, I welcome your thoughts.
Update 2009-01-23: It was pointed out to me in a private /msg that the seeker isn't always the one that has the definition. They should be encouraged to seek council from their boss, teacher, requirements author, etc rather than come up with their own definition if appropriate.
Cheers - L~R
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: What Is A Word?
by moritz (Cardinal) on Jan 22, 2009 at 16:06 UTC | |
by Limbic~Region (Chancellor) on Jan 22, 2009 at 19:30 UTC |