My employer currently provides a service that does that, and, as you were probably expecting to hear, it's a complicated task.

Essentially you're trying to do what I call 'reverse desktop publishing' -- extracting out the relevant parts of the page and assigning them to 'body text' or 'pull quote' or 'caption' or 'chapter heading'. (Yes, I worked on a desktop publishing product called The Office Publish 1987-1990, and I've also done some newspaper production.)

It really depends on how much of the page you want to get. You can probably get most of the page, but it gets harder and harder to get more and more of the page, depending on how complicated it all is.

In any case, I don't work on the PDF stuff -- I do web development -- but I expect the approach you probably want is to take the HTML output and build up a map of the page, then go through the map and find all of the nearest neighbours. From there, it's a pattern recognition task, I'd imagine.

Alex / talexb / Toronto

"Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds


In reply to Re: Preserving layout in pdf to text or html to text conversion by talexb
in thread Preserving layout in pdf to text or html to text conversion by tmoleary

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.