in reply to sentence-safe chop heuristics?
Let's start with some caveats:
First, ++ to thundergnat's warning re the complexities and to the suggestions from Fletch and Not_a_Number.
Second, what's below doesn't address any utf8 issues as those are well covered elsewhere. It also posits (which may be a mistake since you do mention "...just cutting at a certain number of bytes.") that your use of chop in the title and body of the OP means "separate" rather than Perl's "chop" and it applies a "non-gramariam" definition to sentences: they're not required to have a verb, but are, rather, comprised of any set of alphas, numbers, and punctuation and end with a period followed by one-or-more whitespaces.
If that's ok,
#!C:/perl/bin -w # sentence.pl use strict; use vars qw( $sentence @sentence ); while ( <DATA> ) { print "\n<$_>\n\n"; @sentence = split /\.{1,4}\s+/, $_; # push @sentences, $first; for $sentence( @sentence ) { print "\t-$sentence-\n"; } } __DATA__ This is a sentence. This is another. There were two spaces before thi +s sentence. This is the first and only sentence of a new paragraph... but it's a l +ong one. This sentence has commas, thusly, and, for good measure -- dashes -- t +husly. End of test data.
produces output which looks like this:
-This is a sentence.- -This is another.- -There were two spaces before this sentence.- -This is the first and only sentence of a new paragraph.- -but it's a long one.- -This sentence has commas, thusly, and, for good measure -- dashes -- +thusly.- -End of test data.-
So, except for the ellipsis. (ooops!) and absent the possible inclusion of html entities, this seems to me to answer your requirement.
But once you add the possibility of html entities or tags (did you intend to include tags, possibly with arguments?), the separation becomes far more complicated.
For a very simple example, you might be dealing with text like this:
This is a sentence. And this is another.
While this might appear at the end of a paragraph.</p>
<p>And so on.</p>
And now the output does not satisfy your needs:
-This is a sentence.- -This is another.- -There were two spaces before this sentence.- -This is the first and only sentence of a new paragraph.- -but it's a long one.- -This sentence has commas, thusly, and, for good measure -- dashes -- +thusly.- -End of simple test data.- -Begin simple data with html entities.<br> .- -This is a sentence separated from the following sentence by two space +s, one of them an html entity.- - And this is another.- -While this might appear at the end of a paragraph.</p> .- -<p>And so on.</p> .-
This could, of course, be fixed by using an appropriate module to convert the entitites and remove the tags; perhaps HTML::TokeParser::Simple. You could also solve the elipsis problem by a slightly more complex regex (Hint: look for {1,3} periods followed by \s+ followed by a single upper case letter) or better yet, look at the regex in the post (above) by thundergnat.
At that point, however, you'll have to decide whether such code would satisfy your requirement for a "light-weight implementation."
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: sentence-safe chop heuristics?
by Grundle (Scribe) on Apr 18, 2007 at 21:54 UTC | |
by ww (Archbishop) on Apr 19, 2007 at 04:50 UTC | |
by Grundle (Scribe) on Apr 19, 2007 at 15:25 UTC |