Let's start with some caveats:

First, ++ to thundergnat's warning re the complexities and to the suggestions from Fletch and Not_a_Number.

Second, what's below doesn't address any utf8 issues as those are well covered elsewhere. It also posits (which may be a mistake since you do mention "...just cutting at a certain number of bytes.") that your use of chop in the title and body of the OP means "separate" rather than Perl's "chop" and it applies a "non-gramariam" definition to sentences: they're not required to have a verb, but are, rather, comprised of any set of alphas, numbers, and punctuation and end with a period followed by one-or-more whitespaces.

If that's ok,

#!C:/perl/bin -w # sentence.pl use strict; use vars qw( $sentence @sentence ); while ( <DATA> ) { print "\n<$_>\n\n"; @sentence = split /\.{1,4}\s+/, $_; # push @sentences, $first; for $sentence( @sentence ) { print "\t-$sentence-\n"; } } __DATA__ This is a sentence. This is another. There were two spaces before thi +s sentence. This is the first and only sentence of a new paragraph... but it's a l +ong one. This sentence has commas, thusly, and, for good measure -- dashes -- t +husly. End of test data.

produces output which looks like this:

-This is a sentence.- -This is another.- -There were two spaces before this sentence.- -This is the first and only sentence of a new paragraph.- -but it's a long one.- -This sentence has commas, thusly, and, for good measure -- dashes -- +thusly.- -End of test data.-

So, except for the ellipsis. (ooops!) and absent the possible inclusion of html entities, this seems to me to answer your requirement.

But once you add the possibility of html entities or tags (did you intend to include tags, possibly with arguments?), the separation becomes far more complicated.

For a very simple example, you might be dealing with text like this:

This is a sentence.  And this is another.
While this might appear at the end of a paragraph.</p>
<p>And so on.</p>

And now the output does not satisfy your needs:

-This is a sentence.- -This is another.- -There were two spaces before this sentence.- -This is the first and only sentence of a new paragraph.- -but it's a long one.- -This sentence has commas, thusly, and, for good measure -- dashes -- +thusly.- -End of simple test data.- -Begin simple data with html entities.<br> .- -This is a sentence separated from the following sentence by two space +s, one of them an html entity.- -&nbsp;And this is another.- -While this might appear at the end of a paragraph.&lt;/p&gt; .- -&lt;p&gt;And so on.&lt;/p&gt; .-

This could, of course, be fixed by using an appropriate module to convert the entitites and remove the tags; perhaps HTML::TokeParser::Simple. You could also solve the elipsis problem by a slightly more complex regex (Hint: look for {1,3} periods followed by \s+ followed by a single upper case letter) or better yet, look at the regex in the post (above) by thundergnat.

At that point, however, you'll have to decide whether such code would satisfy your requirement for a "light-weight implementation."


In reply to Re: sentence-safe chop heuristics? by ww
in thread sentence-safe chop heuristics? by foomatic99

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.