Your idea to establish a heuristic for what makes a paragraph in badly formatted HTML is interesting. From the top of my head, I can think of several ways a paragraph could be denoted.
Of course, what makes this a special kind of torture for you is that it's not easy to tell if a table row should be kept together and that most of these elements can be contained within one another. You almost need a parse tree of the document plus a heuristic in order to preserve the author's intended paragraphs.
Your life gets easier if you can treat a table as an object as a whole.