in reply to Extracting paragraphs from html
If you can't rely on certain tags (and I agree that you can't), the question is, what is the definition of a paragraph?
Where does it stop? Certainly not at a newline, for we are dealing with HTML, and there might be many newlines in the source code where they are not visible in the page that is actually displayed.
So what would possibly terminate a paragraph?
And for a pragmatic approach, you might want to specify a maximum length at which the given text is truncated. There are people who don't use paragraphs at all, they just type or copy hundreds and thousands of words on a page, like they were writing a novel or like they haven't understood the necessity of formatting at all.
|
|---|