in reply to Re: Seperating HTML by paragraph, sentence
in thread Seperating HTML by paragraph, sentence

it seems that maybe this is more of an algorithms problem than a perl problem.
  • Comment on Re^2: Seperating HTML by paragraph, sentence

Replies are listed 'Best First'.
Re^3: Seperating HTML by paragraph, sentence
by mr_mischief (Monsignor) on Oct 03, 2007 at 17:30 UTC
    Well, if you make it a module and put it on CPAN, we can make it Perl wisdom to point said module out to others who ask. Otherwise, just realizing that your approach has as much to do with the plan as with the tools is wisdom in itself.

    BTW, it's probably possible to write an ad hoc text extractor using heuristic rules and regular expressions to get close to what you want without building a proper tree. I'm not sure without trying if HTML::TreeBuilder or such would be really necessary, but my gut feeling is that it could help quite a bit.