Your idea to establish a heuristic for what makes a paragraph in badly formatted HTML is interesting. From the top of my head, I can think of several ways a paragraph could be denoted.
- text after an opening paragraph tag, of course, up to a possible closing tag
- text after an ending paragraph tag that's not marked as anything special itself
- two or more line break tags successively
- a div
- a list, either ordered or unordered
- a table row, or sometimes an individual cell
Of course, what makes this a special kind of torture for you is that it's not easy to tell if a table row should be kept together and that most of these elements can be contained within one another. You almost need a parse tree of the document plus a heuristic in order to preserve the author's intended paragraphs.
Your life gets easier if you can treat a table as an object as a whole.