I realize that I could probably use the html tags to aid this task, but I have been unsuccessful in doing so. I really appreciate any help!
With that realization in mind, take a look at HTML::TokeParser instead of HTML::Strip.
Peter L. Berghold -- Unix Professional
Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg
| [reply] |
| [reply] |
it seems that maybe this is more of an algorithms problem than a perl problem.
| [reply] |
Well, if you make it a module and put it on CPAN, we can make it Perl wisdom to point said module out to others who ask. Otherwise, just realizing that your approach has as much to do with the plan as with the tools is wisdom in itself.
BTW, it's probably possible to write an ad hoc text extractor using heuristic rules and regular expressions to get close to what you want without building a proper tree. I'm not sure without trying if HTML::TreeBuilder or such would be really necessary, but my gut feeling is that it could help quite a bit.
| [reply] |
One obvious approach to parsing the HTML for what you're calling "paragraphs" is based on the two main kinds of HTML tag.
There are block tags, which create a linebreak above and below, and inline tags which flow with the text.
The DTD for HTML will tell you which is which: http://www.w3.org/TR/html401/sgml/dtd.html If you feel like working through it.
Complications of course arise because there can be blocks inside blocks, and because CSS is allowed to re-define the block/inline setting to suit the author.
Nobody says perl looks like line-noise any more
kids today don't know what line-noise IS ...
| [reply] |