in reply to Preserving layout in pdf to text or html to text conversion

My employer currently provides a service that does that, and, as you were probably expecting to hear, it's a complicated task.

Essentially you're trying to do what I call 'reverse desktop publishing' -- extracting out the relevant parts of the page and assigning them to 'body text' or 'pull quote' or 'caption' or 'chapter heading'. (Yes, I worked on a desktop publishing product called The Office Publish 1987-1990, and I've also done some newspaper production.)

It really depends on how much of the page you want to get. You can probably get most of the page, but it gets harder and harder to get more and more of the page, depending on how complicated it all is.

In any case, I don't work on the PDF stuff -- I do web development -- but I expect the approach you probably want is to take the HTML output and build up a map of the page, then go through the map and find all of the nearest neighbours. From there, it's a pattern recognition task, I'd imagine.

Alex / talexb / Toronto

"Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

  • Comment on Re: Preserving layout in pdf to text or html to text conversion