Re: Tutelage, part two

One thing that bothers me is that you use different criteria for spliting into words and determining the length. In the one case, you are using whatever you get after spliting on certain punctuation; in the other, you are checking only the 26 basic letters.

Before you write the code, you should consider what you consider a "word". This is actually a fairly complex task.

I would approach it as something like:

A "word" is a contiguous sequence of letters or digits with zero or more hyphens or single quotes in the inside. (There are words that begin or end with single quotes, but it isn't possible to easily distinguish those from quoted words.)

With a definition like that, you can pick them out more easily with something like @words = m/((?:\w|\b[-']\b){4,})/g than by splitting on non-word characters.

One problem with that is that \w will include _. To avoid that, use [^\W_] instead of \w and (?<=[^\W_])[-'](?=[^\W_]) instead of the other part (untested).

Another problem is accented characters. If you want to include them, either utf8::encode($big_string) before using it or use locale.

And if your input has words split over lines with a hyphen, you may want to s/-\n//g your input.

If you have test input, you really ought to print out what you are getting as words and compare it to the input to see what special cases you may be missing.

Comment on Re: Tutelage, part two Select or Download Code

Replies are listed 'Best First'.
Re: Re: Tutelage, part two by ctp (Beadle) on Jan 05, 2004 at 04:23 UTC
All great suggestions - thanks! If you have test input, you really ought to print out what you are getting as words and compare it to the input to see what special cases you may be missing. I do indeed have test input that I have become quite intimate with while testing this script. This is how I discovered that nnn'n words were being skipped. I am digging thru the sample file, and I haven't found anything not getting picked up yet...but of course I probably have a lucky set of words in my file (a random text file from work)	[reply]

Replies are listed 'Best First'.

Re: Re: Tutelage, part two
by ctp (Beadle) on Jan 05, 2004 at 04:23 UTC

If you have test input, you really ought to print out what you are getting as words and compare it to the input to see what special cases you may be missing.

[reply]