in reply to Find a good starting section of a long text

I would start by using an HTML parser module to extract the first paragraph. Determine that paragraph's length with a simple call to length. If that paragraph doesn't reach n characters, grab the next one, etc., until you've got n characters. Keep each of those paragraphs in an array. Then add up the lengths of all but the last paragraph in the array.

Next, subtract that sum length from your original n. That will tell you the maximum number of characters you're willing to pull from the final paragraph in your array.

Now you're ready to pass the final paragraph through an adaptation of the following snippet:

use strict; use warnings; my $hardlimit = 100; my $string = <<HERE; This is a test string. I'm going to force it to split on a word if it + kills me. Ok, supercalifragilisticexpialadocious. HERE if ( $string =~ m/^(.{1,$hardlimit})(?=\b)/ ) { print "Match: $1...\n"; print "Length: ", length $1, ".\n"; } __OUTPUT__ Match: This is a test string. I'm going to force it to split on a wor +d if it k ills me. Ok, ... Length: 86.

This will ensure that you're not breaking a word in half. No guarantees on numbers, hyphenated words, etc. For that, you'll have to improve the regexp. But it's a simple example.

I don't see a good way to prevent sentences from being split. You would have to actually parse the English (or whatever) language to determine what constitutes a sentence. But it's not too hard to at least prevent words from being broken in half.

Hope this helps.


Dave

Replies are listed 'Best First'.
Re^2: Find a good starting section of a long text
by bageler (Hermit) on Aug 12, 2004 at 19:58 UTC
    One problem I've had with this regexp (in my case, trying to break off the first sentance) is cases where abbreviations, i.e. M(r|rs)., Corp., Inc., etc. are around but are not a good place to split a body of text for something like inserting an advert. I handle it by making sure the word is at least 5 chars first.
    ($first,$rest) = $body =~ /(.*?\w{5,}\.)(.*)/;
      For determining sentence breaks, you might take a look at Lingua::EN::Sentence, which tries to be intelligent about abbreviations and such.