comment on

One thing that bothers me is that you use different criteria for spliting into words and determining the length. In the one case, you are using whatever you get after spliting on certain punctuation; in the other, you are checking only the 26 basic letters.

Before you write the code, you should consider what you consider a "word". This is actually a fairly complex task.

I would approach it as something like:

A "word" is a contiguous sequence of letters or digits with zero or more hyphens or single quotes in the inside. (There are words that begin or end with single quotes, but it isn't possible to easily distinguish those from quoted words.)

With a definition like that, you can pick them out more easily with something like @words = m/((?:\w|\b[-']\b){4,})/g than by splitting on non-word characters.

One problem with that is that \w will include _. To avoid that, use [^\W_] instead of \w and (?<=[^\W_])[-'](?=[^\W_]) instead of the other part (untested).

Another problem is accented characters. If you want to include them, either utf8::encode($big_string) before using it or use locale.

And if your input has words split over lines with a hyphen, you may want to s/-\n//g your input.

If you have test input, you really ought to print out what you are getting as words and compare it to the input to see what special cases you may be missing.

In reply to Re: Tutelage, part two by ysth
in thread Tutelage, part two by ctp

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.