in reply to Seperating HTML by paragraph, sentence

One obvious approach to parsing the HTML for what you're calling "paragraphs" is based on the two main kinds of HTML tag.

There are block tags, which create a linebreak above and below, and inline tags which flow with the text.

The DTD for HTML will tell you which is which: http://www.w3.org/TR/html401/sgml/dtd.html If you feel like working through it.

Complications of course arise because there can be blocks inside blocks, and because CSS is allowed to re-define the block/inline setting to suit the author.



Nobody says perl looks like line-noise any more
kids today don't know what line-noise IS ...
  • Comment on Re: Seperating HTML by paragraph, sentence