Before you write the code, you should consider what you consider a "word". This is actually a fairly complex task.
I would approach it as something like:
A "word" is a contiguous sequence of letters or digits with zero or more hyphens or single quotes in the inside. (There are words that begin or end with single quotes, but it isn't possible to easily distinguish those from quoted words.)With a definition like that, you can pick them out more easily with something like @words = m/((?:\w|\b[-']\b){4,})/g than by splitting on non-word characters.
One problem with that is that \w will include _. To avoid that, use [^\W_] instead of \w and (?<=[^\W_])[-'](?=[^\W_]) instead of the other part (untested).
Another problem is accented characters. If you want to include them, either utf8::encode($big_string) before using it or use locale.
And if your input has words split over lines with a hyphen, you may want to s/-\n//g your input.
If you have test input, you really ought to print out what you are getting as words and compare it to the input to see what special cases you may be missing.
In reply to Re: Tutelage, part two
by ysth
in thread Tutelage, part two
by ctp
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |