in reply to Preserving layout in pdf to text or html to text conversion

Maybe this is obvious, but if you're going to try to add some of this functionality yourself, consider subclassing or otherwise building on one of the existing parser modules you mentioned. If you can use the existing module to do most of the work, you could focus on processing the DIV tags and the information in them. You need to find a parser module that keeps the CSS info rather than immediately throwing it out.
  • Comment on Re: Preserving layout in pdf to text or html to text conversion