Find a good starting section of a long text

johnnywang has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Find a good starting section of a long text by davido (Cardinal) on Aug 12, 2004 at 19:08 UTC
I would start by using an HTML parser module to extract the first paragraph. Determine that paragraph's length with a simple call to length. If that paragraph doesn't reach n characters, grab the next one, etc., until you've got n characters. Keep each of those paragraphs in an array. Then add up the lengths of all but the last paragraph in the array. Next, subtract that sum length from your original n. That will tell you the maximum number of characters you're willing to pull from the final paragraph in your array. Now you're ready to pass the final paragraph through an adaptation of the following snippet: `use strict; use warnings; my $hardlimit = 100; my $string = <<HERE; This is a test string. I'm going to force it to split on a word if it + kills me. Ok, supercalifragilisticexpialadocious. HERE if ( $string =~ m/^(.{1,$hardlimit})(?=\b)/ ) { print "Match: $1...\n"; print "Length: ", length $1, ".\n"; } __OUTPUT__ Match: This is a test string. I'm going to force it to split on a wor +d if it k ills me. Ok, ... Length: 86.` [download] This will ensure that you're not breaking a word in half. No guarantees on numbers, hyphenated words, etc. For that, you'll have to improve the regexp. But it's a simple example. I don't see a good way to prevent sentences from being split. You would have to actually parse the English (or whatever) language to determine what constitutes a sentence. But it's not too hard to at least prevent words from being broken in half. Hope this helps. Dave	[reply] [d/l]
Re^2: Find a good starting section of a long text by bageler (Hermit) on Aug 12, 2004 at 19:58 UTC
One problem I've had with this regexp (in my case, trying to break off the first sentance) is cases where abbreviations, i.e. M(r\|rs)., Corp., Inc., etc. are around but are not a good place to split a body of text for something like inserting an advert. I handle it by making sure the word is at least 5 chars first. `($first,$rest) = $body =~ /(.?\w{5,}\.)(.)/;` [download]	[reply] [d/l]
Re^3: Find a good starting section of a long text by Anonymous Monk on Aug 13, 2004 at 01:35 UTC
For determining sentence breaks, you might take a look at Lingua::EN::Sentence, which tries to be intelligent about abbreviations and such.	[reply]
Re: Find a good starting section of a long text by BrowserUk (Patriarch) on Aug 12, 2004 at 20:14 UTC
Maybe something like this could be adjusted to your needs? Read more... (4 kB) Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon _{janitored by ybiC: Balanced <readmore> tags around longish example codeblock}	[reply] [d/l] [select]
Re: Find a good starting section of a long text by NiJo (Friar) on Aug 12, 2004 at 19:35 UTC
I'd develop weighted tests for each condition. Count chars for paragraphs ending shorter ($par_short) and longer ($par_long) than $goal. Do the same for sentences ($sen_short and $sen_long) and words ($word_short, $word_long). Then use the minimum of ( ($goal - $test) * 1/$weight ) ** 2 Tuning the weight of the tests requires looking at sample data, but (30, 15, 5) for par, sen and word should be a start. The given example weights a paragraph deviating by 30 chars from $goal the same as a word deviating 5 chars.	[reply]
Re: Find a good starting section of a long text by Tuppence (Pilgrim) on Aug 13, 2004 at 02:09 UTC
Since you seem to already have a good idea of what you want to have happen, I would suggest writing some tests that demonstrate that your function behaves as you wish it to. Often times I find that, by the time I have the tests written enough to be useful, the code almost writes itself.. and it makes the really hard things possible by keeping a hard copy around of what getting each little piece working entails. If you have a representative sample of the data you are trying to reformat, it might help to pull out some 'hard' cases from it, or indeed start with a simplistic solution, and use that until you can find some data that it breaks with.	[reply]