in reply to sentence-safe chop heuristics?

Let's start with some caveats:

First, ++ to thundergnat's warning re the complexities and to the suggestions from Fletch and Not_a_Number.

Second, what's below doesn't address any utf8 issues as those are well covered elsewhere. It also posits (which may be a mistake since you do mention "...just cutting at a certain number of bytes.") that your use of chop in the title and body of the OP means "separate" rather than Perl's "chop" and it applies a "non-gramariam" definition to sentences: they're not required to have a verb, but are, rather, comprised of any set of alphas, numbers, and punctuation and end with a period followed by one-or-more whitespaces.

If that's ok,

#!C:/perl/bin -w # sentence.pl use strict; use vars qw( $sentence @sentence ); while ( <DATA> ) { print "\n<$_>\n\n"; @sentence = split /\.{1,4}\s+/, $_; # push @sentences, $first; for $sentence( @sentence ) { print "\t-$sentence-\n"; } } __DATA__ This is a sentence. This is another. There were two spaces before thi +s sentence. This is the first and only sentence of a new paragraph... but it's a l +ong one. This sentence has commas, thusly, and, for good measure -- dashes -- t +husly. End of test data.

produces output which looks like this:

-This is a sentence.- -This is another.- -There were two spaces before this sentence.- -This is the first and only sentence of a new paragraph.- -but it's a long one.- -This sentence has commas, thusly, and, for good measure -- dashes -- +thusly.- -End of test data.-

So, except for the ellipsis. (ooops!) and absent the possible inclusion of html entities, this seems to me to answer your requirement.

But once you add the possibility of html entities or tags (did you intend to include tags, possibly with arguments?), the separation becomes far more complicated.

For a very simple example, you might be dealing with text like this:

This is a sentence.  And this is another.
While this might appear at the end of a paragraph.</p>
<p>And so on.</p>

And now the output does not satisfy your needs:

-This is a sentence.- -This is another.- -There were two spaces before this sentence.- -This is the first and only sentence of a new paragraph.- -but it's a long one.- -This sentence has commas, thusly, and, for good measure -- dashes -- +thusly.- -End of simple test data.- -Begin simple data with html entities.<br> .- -This is a sentence separated from the following sentence by two space +s, one of them an html entity.- -&nbsp;And this is another.- -While this might appear at the end of a paragraph.&lt;/p&gt; .- -&lt;p&gt;And so on.&lt;/p&gt; .-

This could, of course, be fixed by using an appropriate module to convert the entitites and remove the tags; perhaps HTML::TokeParser::Simple. You could also solve the elipsis problem by a slightly more complex regex (Hint: look for {1,3} periods followed by \s+ followed by a single upper case letter) or better yet, look at the regex in the post (above) by thundergnat.

At that point, however, you'll have to decide whether such code would satisfy your requirement for a "light-weight implementation."

Replies are listed 'Best First'.
Re^2: sentence-safe chop heuristics?
by Grundle (Scribe) on Apr 18, 2007 at 21:54 UTC
    That simple little algorithm is hardly even close to a reasonable solution. There are too many cases it ignores when dealing with what is known as a "sentence" in the English language. There are many special cases involved. Solving this using a \s+ followed by a single upper case letter is wrong wrong wrong! For a fast fix to your problem I would suggest using th Lingua::EN::Sentence module. It has most cases covered but you would be amazed at how much it can fail. For small sets of data it should be more than adequate. One of the best ways is to write a statistical parser using bayes theorem to "guess" if the end of a sentence has been reached. The downside to this method is that you have to make a "training set" so that it can build a statistical model to work on. The previous algorithm for the following input
    This is a test. Am I testing this right? What if a proper name like +John A. Smith is entered? Wow that is crazy! On Apr. 18 I ran this +to see if it worked. What if I try A vs. B or a vs. b? Is it going to work? What if I tal +k about the U.S.S.R. or the U.S.A.? "I like to speak like this. It m +akes me laugh." said the funny man.
    Will output
    -This is a test- -Am I testing this right? What if a proper name like John A- -Smith is entered? Wow that is crazy! On Apr- -18 I ran this to see if it worked- -What if I try A vs- -B or a vs- -b? Is it going to work? What if I talk about the U.S.S.R- -or the U.S.A.? "I like to speak like this- -It makes me laugh." said the funny man.-
    Notice how often it fails for "simple" sentences...
      Many excellent points, Grundle; I could almost say, "a grundle of excellent examples of cases where my preceding post fails horribly.

      But -- perhaps my point was not made sufficiently blatant: the OP's requirements are unlikely to be met by any "lightweight" approach or simple algorithm. Either will tend to produce simple minded output.

      As a Not_a_Number noted high up in this thread, Lingua::EN::Sentence may be a better choice (your added note regarding training is likely to be helpful to the OP) but unless I've missed something there (certainly possible, as I've only scanned it quickly), dealing with html entities is going to take a lot of extending.

        Yes, you are absolutely correct! When dealing with HTML entities this process should be done in two steps.

        Step 1: Data extraction - Use an HTML Parser to pull out all of the data first, so that it can be represented in a humanly readable format.

        Step 2: Sentence extraction - Use your sentence parser to break the humanly readable information up into separate sentences.

        The problem becomes even more exacerbated when you have to also consider different tagging formats such as XML and its many variants, or an SGML standard, etc. etc. ad. nauseum.

        Here is another thought I had recently. Would it be possible to write a Grammar and use RecDescent to pull out sentences? I really haven't investigated it thoroughly yet, but I thought it might be an interesting exercise.