in reply to Re: sentence-safe chop heuristics?
in thread sentence-safe chop heuristics?

That simple little algorithm is hardly even close to a reasonable solution. There are too many cases it ignores when dealing with what is known as a "sentence" in the English language. There are many special cases involved. Solving this using a \s+ followed by a single upper case letter is wrong wrong wrong! For a fast fix to your problem I would suggest using th Lingua::EN::Sentence module. It has most cases covered but you would be amazed at how much it can fail. For small sets of data it should be more than adequate. One of the best ways is to write a statistical parser using bayes theorem to "guess" if the end of a sentence has been reached. The downside to this method is that you have to make a "training set" so that it can build a statistical model to work on. The previous algorithm for the following input
This is a test. Am I testing this right? What if a proper name like +John A. Smith is entered? Wow that is crazy! On Apr. 18 I ran this +to see if it worked. What if I try A vs. B or a vs. b? Is it going to work? What if I tal +k about the U.S.S.R. or the U.S.A.? "I like to speak like this. It m +akes me laugh." said the funny man.
Will output
-This is a test- -Am I testing this right? What if a proper name like John A- -Smith is entered? Wow that is crazy! On Apr- -18 I ran this to see if it worked- -What if I try A vs- -B or a vs- -b? Is it going to work? What if I talk about the U.S.S.R- -or the U.S.A.? "I like to speak like this- -It makes me laugh." said the funny man.-
Notice how often it fails for "simple" sentences...

Replies are listed 'Best First'.
Re^3: sentence-safe chop heuristics?
by ww (Archbishop) on Apr 19, 2007 at 04:50 UTC
    Many excellent points, Grundle; I could almost say, "a grundle of excellent examples of cases where my preceding post fails horribly.

    But -- perhaps my point was not made sufficiently blatant: the OP's requirements are unlikely to be met by any "lightweight" approach or simple algorithm. Either will tend to produce simple minded output.

    As a Not_a_Number noted high up in this thread, Lingua::EN::Sentence may be a better choice (your added note regarding training is likely to be helpful to the OP) but unless I've missed something there (certainly possible, as I've only scanned it quickly), dealing with html entities is going to take a lot of extending.

      Yes, you are absolutely correct! When dealing with HTML entities this process should be done in two steps.

      Step 1: Data extraction - Use an HTML Parser to pull out all of the data first, so that it can be represented in a humanly readable format.

      Step 2: Sentence extraction - Use your sentence parser to break the humanly readable information up into separate sentences.

      The problem becomes even more exacerbated when you have to also consider different tagging formats such as XML and its many variants, or an SGML standard, etc. etc. ad. nauseum.

      Here is another thought I had recently. Would it be possible to write a Grammar and use RecDescent to pull out sentences? I really haven't investigated it thoroughly yet, but I thought it might be an interesting exercise.