in reply to Find a good starting section of a long text
I would start by using an HTML parser module to extract the first paragraph. Determine that paragraph's length with a simple call to length. If that paragraph doesn't reach n characters, grab the next one, etc., until you've got n characters. Keep each of those paragraphs in an array. Then add up the lengths of all but the last paragraph in the array.
Next, subtract that sum length from your original n. That will tell you the maximum number of characters you're willing to pull from the final paragraph in your array.
Now you're ready to pass the final paragraph through an adaptation of the following snippet:
use strict; use warnings; my $hardlimit = 100; my $string = <<HERE; This is a test string. I'm going to force it to split on a word if it + kills me. Ok, supercalifragilisticexpialadocious. HERE if ( $string =~ m/^(.{1,$hardlimit})(?=\b)/ ) { print "Match: $1...\n"; print "Length: ", length $1, ".\n"; } __OUTPUT__ Match: This is a test string. I'm going to force it to split on a wor +d if it k ills me. Ok, ... Length: 86.
This will ensure that you're not breaking a word in half. No guarantees on numbers, hyphenated words, etc. For that, you'll have to improve the regexp. But it's a simple example.
I don't see a good way to prevent sentences from being split. You would have to actually parse the English (or whatever) language to determine what constitutes a sentence. But it's not too hard to at least prevent words from being broken in half.
Hope this helps.
Dave
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Find a good starting section of a long text
by bageler (Hermit) on Aug 12, 2004 at 19:58 UTC | |
by Anonymous Monk on Aug 13, 2004 at 01:35 UTC |