foomatic99 has asked for the wisdom of the Perl Monks concerning the following question:

I am looking for a way to chop text at sentence boundaries. I realize that somebody out there must have come up with some heuristics for doing this, though I can't think of any unambiguous terms to search for something like this.

I realize that nothing in a reasonably light-weight implementation is going to get it right 100% of the time, but at least I should be able to find something better than just cutting at a certain number of bytes.

The text is English, utf8... possibly with HTML entity references.

Can anybody help me out here?

Replies are listed 'Best First'.
Re: sentence-safe chop heuristics?
by Not_a_Number (Prior) on Apr 18, 2007 at 19:24 UTC

    Lingua::EN::Sentence looks to be just what you need.

    From the docs:

    The Lingua::EN::Sentence module contains the function get_sentences, which splits text into its constituent sentences, based on a regular expression and a list of abbreviations (built in and given).

    Certain well know exceptions, such as abreviations, may cause incorrect segmentations. But some of them are already integrated into this code and are being taken care of. Still, if you see that there are words causing the get_sentences() to fail, you can add those to the module, so it notices them.
Re: sentence-safe chop heuristics?
by Fletch (Bishop) on Apr 18, 2007 at 14:58 UTC
Re: sentence-safe chop heuristics?
by thundergnat (Deacon) on Apr 18, 2007 at 19:54 UTC

    That is a very tricky problem to get correct. Natural language is difficult for machines to parse.

    If you can live with something that will have trouble with honorifics and abbreviations, but be mostly accurate, you could use a fairly simple regex.

    use warnings; use strict; my $data; while (my $line = <DATA>){ chomp $line; $data .= " $line"; $data =~ s/ +/ /g; if ($data =~ m/(.+?[\?\.\!]('|")?\s)(?=\p{Upper}|\p{Punct})/){ my $sentence = $1; print $sentence,"\n\n"; substr $data, 0 ,length $sentence, ''; } } print $data if $data; __DATA__ foomatic99 has asked for the wisdom of the Perl Monks concerning the following question: I am looking for a way to chop text at sentence boundaries. I realize that somebody out there must have come up with some heuristics for doing this, though I can't think of any unambiguous terms to search for something like this. I realize that nothing in a reasonably light-weight implementation is going to get it right 100% of the time, but at least I should be able to find something better than just cutting at a certain number of bytes. The text is English, utf8... possibly with HTML entity references. Can anybody help me out here?

    There are all kinds of cases where this will fail, but it may be good enough.

Re: sentence-safe chop heuristics?
by ww (Archbishop) on Apr 18, 2007 at 21:14 UTC

    Let's start with some caveats:

    First, ++ to thundergnat's warning re the complexities and to the suggestions from Fletch and Not_a_Number.

    Second, what's below doesn't address any utf8 issues as those are well covered elsewhere. It also posits (which may be a mistake since you do mention "...just cutting at a certain number of bytes.") that your use of chop in the title and body of the OP means "separate" rather than Perl's "chop" and it applies a "non-gramariam" definition to sentences: they're not required to have a verb, but are, rather, comprised of any set of alphas, numbers, and punctuation and end with a period followed by one-or-more whitespaces.

    If that's ok,

    #!C:/perl/bin -w # sentence.pl use strict; use vars qw( $sentence @sentence ); while ( <DATA> ) { print "\n<$_>\n\n"; @sentence = split /\.{1,4}\s+/, $_; # push @sentences, $first; for $sentence( @sentence ) { print "\t-$sentence-\n"; } } __DATA__ This is a sentence. This is another. There were two spaces before thi +s sentence. This is the first and only sentence of a new paragraph... but it's a l +ong one. This sentence has commas, thusly, and, for good measure -- dashes -- t +husly. End of test data.

    produces output which looks like this:

    -This is a sentence.- -This is another.- -There were two spaces before this sentence.- -This is the first and only sentence of a new paragraph.- -but it's a long one.- -This sentence has commas, thusly, and, for good measure -- dashes -- +thusly.- -End of test data.-

    So, except for the ellipsis. (ooops!) and absent the possible inclusion of html entities, this seems to me to answer your requirement.

    But once you add the possibility of html entities or tags (did you intend to include tags, possibly with arguments?), the separation becomes far more complicated.

    For a very simple example, you might be dealing with text like this:

    This is a sentence.  And this is another.
    While this might appear at the end of a paragraph.</p>
    <p>And so on.</p>

    And now the output does not satisfy your needs:

    -This is a sentence.- -This is another.- -There were two spaces before this sentence.- -This is the first and only sentence of a new paragraph.- -but it's a long one.- -This sentence has commas, thusly, and, for good measure -- dashes -- +thusly.- -End of simple test data.- -Begin simple data with html entities.<br> .- -This is a sentence separated from the following sentence by two space +s, one of them an html entity.- -&nbsp;And this is another.- -While this might appear at the end of a paragraph.&lt;/p&gt; .- -&lt;p&gt;And so on.&lt;/p&gt; .-

    This could, of course, be fixed by using an appropriate module to convert the entitites and remove the tags; perhaps HTML::TokeParser::Simple. You could also solve the elipsis problem by a slightly more complex regex (Hint: look for {1,3} periods followed by \s+ followed by a single upper case letter) or better yet, look at the regex in the post (above) by thundergnat.

    At that point, however, you'll have to decide whether such code would satisfy your requirement for a "light-weight implementation."

      That simple little algorithm is hardly even close to a reasonable solution. There are too many cases it ignores when dealing with what is known as a "sentence" in the English language. There are many special cases involved. Solving this using a \s+ followed by a single upper case letter is wrong wrong wrong! For a fast fix to your problem I would suggest using th Lingua::EN::Sentence module. It has most cases covered but you would be amazed at how much it can fail. For small sets of data it should be more than adequate. One of the best ways is to write a statistical parser using bayes theorem to "guess" if the end of a sentence has been reached. The downside to this method is that you have to make a "training set" so that it can build a statistical model to work on. The previous algorithm for the following input
      This is a test. Am I testing this right? What if a proper name like +John A. Smith is entered? Wow that is crazy! On Apr. 18 I ran this +to see if it worked. What if I try A vs. B or a vs. b? Is it going to work? What if I tal +k about the U.S.S.R. or the U.S.A.? "I like to speak like this. It m +akes me laugh." said the funny man.
      Will output
      -This is a test- -Am I testing this right? What if a proper name like John A- -Smith is entered? Wow that is crazy! On Apr- -18 I ran this to see if it worked- -What if I try A vs- -B or a vs- -b? Is it going to work? What if I talk about the U.S.S.R- -or the U.S.A.? "I like to speak like this- -It makes me laugh." said the funny man.-
      Notice how often it fails for "simple" sentences...
        Many excellent points, Grundle; I could almost say, "a grundle of excellent examples of cases where my preceding post fails horribly.

        But -- perhaps my point was not made sufficiently blatant: the OP's requirements are unlikely to be met by any "lightweight" approach or simple algorithm. Either will tend to produce simple minded output.

        As a Not_a_Number noted high up in this thread, Lingua::EN::Sentence may be a better choice (your added note regarding training is likely to be helpful to the OP) but unless I've missed something there (certainly possible, as I've only scanned it quickly), dealing with html entities is going to take a lot of extending.