tmoleary has asked for the wisdom of the Perl Monks concerning the following question:

I am taking a linguistics course in which we are working on projects to crawl the web looking for linguistics papers, extract from them blocks of text that look like this:
(1) Emine elma-yi ye-di. Emine apple-ACC eat-PAST.3sg `Emine ate the apple.'
and analyze them for the linguistic information that they contain. It is important to keep the text in three lines and, as much as possible, preserve the whitespace between words on each line.
Most of the linguistics papers that we find are PDF files, and the method we have been using so far is just to run a PDF to text converter that has a -layout option, and then extract the blocks of text that we are interested in from the converted papers. This works ok except that the PDF to text converter sometimes gets confused by accented Latin-1 characters, non-Latin-1 characters, and probably by the internal structure of some PDF files. In those cases it can convert some lines from the PDF file to anything from slightly corrupted text to garbage.
Recently I found a PDF to HTML converter that does a much better job of preserving the layout of the PDF files and avoiding text corruption, but of course its output is HTML. The HTML specifies locations of text on a page with DIV tags that look like this:
<DIV style="position:absolute;top:217;left:216">
I was thinking of using a Perl HTML parser to write a little application to convert the HTML to plain text while preserving the layout of the blocks of text as much as possible, but if something like that already exists, I'd like to just use it instead. I have tried out several HTML to text converters, and none of the ones I tried pay attention to information like the style attribute of the DIV tag above. I realize that it is not possible to put a block of plain text exactly 217 units down and 216 units to the right of a the edge of a page, but it would probably be sufficient to use that information to figure out which text should be above, below or on the same line as other text, and, to some extent, by how much. If you know of an HTML to text converter that takes this kind of information into account, could you please let me know? Alternatively, if you know of a PDF to text converter that does a good job preserving layout and handles non-ASCII text well, could you please let me know about that too? Thanks.

Replies are listed 'Best First'.
Re: Preserving layout in pdf to text or html to text conversion
by talexb (Chancellor) on Apr 10, 2007 at 17:40 UTC

    My employer currently provides a service that does that, and, as you were probably expecting to hear, it's a complicated task.

    Essentially you're trying to do what I call 'reverse desktop publishing' -- extracting out the relevant parts of the page and assigning them to 'body text' or 'pull quote' or 'caption' or 'chapter heading'. (Yes, I worked on a desktop publishing product called The Office Publish 1987-1990, and I've also done some newspaper production.)

    It really depends on how much of the page you want to get. You can probably get most of the page, but it gets harder and harder to get more and more of the page, depending on how complicated it all is.

    In any case, I don't work on the PDF stuff -- I do web development -- but I expect the approach you probably want is to take the HTML output and build up a map of the page, then go through the map and find all of the nearest neighbours. From there, it's a pattern recognition task, I'd imagine.

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

Re: Preserving layout in pdf to text or html to text conversion
by samtregar (Abbot) on Apr 10, 2007 at 18:15 UTC
    You might look at text-mode browsers to see if any of them support CSS positioning. Possible candidates: lynx, links, Emacs w3m-mode. My guess is that none of them support it, but it's worth a look. If you find one that does it's probably pretty easy to get a text-dump from them.

    -sam

Re: Preserving layout in pdf to text or html to text conversion
by amaguk (Sexton) on Apr 10, 2007 at 21:19 UTC
    For dumping text from PDF files, I use pdftotext (from xpdf). And when I want to conserve the layout, I use the -layout option. The -raw option could be useful too. So maybe this tool could help you.
Re: Preserving layout in pdf to text or html to text conversion
by Sixtease (Friar) on Apr 10, 2007 at 20:07 UTC
    It just occured to me... Do you really need to reconstruct the whole page in text mode? If I got you right, you only need to check whether some lines are one above another. You could try to check for such relation between lines from the CSS data with much less effort than in case of re-creating the whole document.
Re: Preserving layout in pdf to text or html to text conversion
by cbrandtbuffalo (Deacon) on Apr 10, 2007 at 20:11 UTC
    Maybe this is obvious, but if you're going to try to add some of this functionality yourself, consider subclassing or otherwise building on one of the existing parser modules you mentioned. If you can use the existing module to do most of the work, you could focus on processing the DIV tags and the information in them. You need to find a parser module that keeps the CSS info rather than immediately throwing it out.
Re: Preserving layout in pdf to text or html to text conversion
by Sixtease (Friar) on Apr 10, 2007 at 17:26 UTC
    I don't know about one and I think making it would be very beneficial. Let me know if you manage.