Re: Seperating HTML by paragraph, sentence

Your idea to establish a heuristic for what makes a paragraph in badly formatted HTML is interesting. From the top of my head, I can think of several ways a paragraph could be denoted.

text after an opening paragraph tag, of course, up to a possible closing tag
text after an ending paragraph tag that's not marked as anything special itself
two or more line break tags successively
a div
a list, either ordered or unordered
a table row, or sometimes an individual cell

Of course, what makes this a special kind of torture for you is that it's not easy to tell if a table row should be kept together and that most of these elements can be contained within one another. You almost need a parse tree of the document plus a heuristic in order to preserve the author's intended paragraphs.

Your life gets easier if you can treat a table as an object as a whole.

Comment on Re: Seperating HTML by paragraph, sentence

Replies are listed 'Best First'.
Re^2: Seperating HTML by paragraph, sentence by downer (Monk) on Oct 03, 2007 at 17:13 UTC
it seems that maybe this is more of an algorithms problem than a perl problem.	[reply]
Re^3: Seperating HTML by paragraph, sentence by mr_mischief (Monsignor) on Oct 03, 2007 at 17:30 UTC
Well, if you make it a module and put it on CPAN, we can make it Perl wisdom to point said module out to others who ask. Otherwise, just realizing that your approach has as much to do with the plan as with the tools is wisdom in itself. BTW, it's probably possible to write an ad hoc text extractor using heuristic rules and regular expressions to get close to what you want without building a proper tree. I'm not sure without trying if HTML::TreeBuilder or such would be really necessary, but my gut feeling is that it could help quite a bit.	[reply]