I would start by using an HTML parser module to extract the first paragraph. Determine that paragraph's length with a simple call to length. If that paragraph doesn't reach n characters, grab the next one, etc., until you've got n characters. Keep each of those paragraphs in an array. Then add up the lengths of all but the last paragraph in the array.

Next, subtract that sum length from your original n. That will tell you the maximum number of characters you're willing to pull from the final paragraph in your array.

Now you're ready to pass the final paragraph through an adaptation of the following snippet:

use strict; use warnings; my $hardlimit = 100; my $string = <<HERE; This is a test string. I'm going to force it to split on a word if it + kills me. Ok, supercalifragilisticexpialadocious. HERE if ( $string =~ m/^(.{1,$hardlimit})(?=\b)/ ) { print "Match: $1...\n"; print "Length: ", length $1, ".\n"; } __OUTPUT__ Match: This is a test string. I'm going to force it to split on a wor +d if it k ills me. Ok, ... Length: 86.

This will ensure that you're not breaking a word in half. No guarantees on numbers, hyphenated words, etc. For that, you'll have to improve the regexp. But it's a simple example.

I don't see a good way to prevent sentences from being split. You would have to actually parse the English (or whatever) language to determine what constitutes a sentence. But it's not too hard to at least prevent words from being broken in half.

Hope this helps.


Dave


In reply to Re: Find a good starting section of a long text by davido
in thread Find a good starting section of a long text by johnnywang

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.