Match non-whitespace characters

biofeng918 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Match non-whitespace characters by erix (Prior) on Oct 31, 2004 at 11:55 UTC
I think you want to change those \s to \S \s matches whitespace, \S matches non-whitespace. And you need a space behind the comma: "Hello there, neighbour". The $_ as you posted it won't match otherwise, even with \S	[reply]
Re: Match non-whitespace characters by graff (Chancellor) on Oct 31, 2004 at 14:35 UTC
In addition to the "\s -> \S" issue cited above, you probably want a more flexible regex, that will match all of the words, regardless of how many words there are, and regardless of what the word boundaries are. `$_ = "Hello there,neighbour"; @words = ( /(\w+)/g ); print "Words are: @words\n";` [download] Of course, you need to be a bit careful about what you want to view as word boundaries -- consider: `$_ = "We're done. Mr. O'Conner sent e-mail - said 'no thanks'.";` [download] Hyphen/dash and apostrophe/single-quote are notoriously ambiguous in English text, making it pretty tricky to do "coherent" word tokenization (and sometimes periods are tough as well). Perl's "\w" and "\b" expressions will always treat them as word boundaries, even when humans would not. If the data you'll be handling doesn't have any ambiguous cases of these characters, you're just very lucky. If it does, a slightly more complicated regex is needed -- something like: `@words = ( /(\w+(?:[-']\w+)?)/g;` [download] That is, one or more "word-like" characters (actually, alphanumerics, digits and/or underscores), optionally followed by a hyphen or apostrophe, so long as there are more work-like characters immediately after that. But even with that, some people agonize over the different uses of hyphens; most would agree that something like "re-edit" should be one "word", but what about "kick-in-the-pants hot sauce" -- how many words there? Oh, and if you don't want to match "words" containing digits or underscores, just replace "\w" with `[a-z]` and add an "i" modifier at the end (next to the "g"), to make it case-insensitive. (update: And then there's the problem of accented characters... like é etc; if you have to go there, check out the perlunicode man page -- make sure your text is in utf8, and use `\p{Letter}` in place of "\w" or `[a-z]`.)	[reply] [d/l] [select]
Re: Match non-whitespace characters by steves (Curate) on Oct 31, 2004 at 11:57 UTC
\s (lower case 's') matches whitespace. Upper case 'S', \S is used to match non-whitespace.	[reply]