comment on

In addition to the "\s -> \S" issue cited above, you probably want a more flexible regex, that will match all of the words, regardless of how many words there are, and regardless of what the word boundaries are.

$_ = "Hello there,neighbour";

@words = ( /(\w+)/g );

print "Words are: @words\n";
[download]

Of course, you need to be a bit careful about what you want to view as word boundaries -- consider:

$_ = "We're done. Mr. O'Conner sent e-mail - said 'no thanks'.";
[download]

Hyphen/dash and apostrophe/single-quote are notoriously ambiguous in English text, making it pretty tricky to do "coherent" word tokenization (and sometimes periods are tough as well). Perl's "\w" and "\b" expressions will always treat them as word boundaries, even when humans would not. If the data you'll be handling doesn't have any ambiguous cases of these characters, you're just very lucky. If it does, a slightly more complicated regex is needed -- something like:

@words = ( /(\w+(?:[-']\w+)?)/g;
[download]

That is, one or more "word-like" characters (actually, alphanumerics, digits and/or underscores), optionally followed by a hyphen or apostrophe, so long as there are more work-like characters immediately after that.

But even with that, some people agonize over the different uses of hyphens; most would agree that something like "re-edit" should be one "word", but what about "kick-in-the-pants hot sauce" -- how many words there?

Oh, and if you don't want to match "words" containing digits or underscores, just replace "\w" with [a-z] and add an "i" modifier at the end (next to the "g"), to make it case-insensitive.

(update: And then there's the problem of accented characters... like é etc; if you have to go there, check out the perlunicode man page -- make sure your text is in utf8, and use \p{Letter} in place of "\w" or [a-z].)

In reply to Re: Match non-whitespace characters by graff
in thread Match non-whitespace characters by biofeng918

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.