Of course, you need to be a bit careful about what you want to view as word boundaries -- consider:$_ = "Hello there,neighbour"; @words = ( /(\w+)/g ); print "Words are: @words\n";
Hyphen/dash and apostrophe/single-quote are notoriously ambiguous in English text, making it pretty tricky to do "coherent" word tokenization (and sometimes periods are tough as well). Perl's "\w" and "\b" expressions will always treat them as word boundaries, even when humans would not. If the data you'll be handling doesn't have any ambiguous cases of these characters, you're just very lucky. If it does, a slightly more complicated regex is needed -- something like:$_ = "We're done. Mr. O'Conner sent e-mail - said 'no thanks'.";
That is, one or more "word-like" characters (actually, alphanumerics, digits and/or underscores), optionally followed by a hyphen or apostrophe, so long as there are more work-like characters immediately after that.@words = ( /(\w+(?:[-']\w+)?)/g;
But even with that, some people agonize over the different uses of hyphens; most would agree that something like "re-edit" should be one "word", but what about "kick-in-the-pants hot sauce" -- how many words there?
Oh, and if you don't want to match "words" containing digits or underscores, just replace "\w" with [a-z] and add an "i" modifier at the end (next to the "g"), to make it case-insensitive.
(update: And then there's the problem of accented characters... like é etc; if you have to go there, check out the perlunicode man page -- make sure your text is in utf8, and use \p{Letter} in place of "\w" or [a-z].)
In reply to Re: Match non-whitespace characters
by graff
in thread Match non-whitespace characters
by biofeng918
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |