I am taking a linguistics course in which we are working on projects to crawl the web looking for linguistics papers, extract from them blocks of text that look like this:
(1) Emine elma-yi ye-di.
Emine apple-ACC eat-PAST.3sg
`Emine ate the apple.'
and analyze them for the linguistic information that they contain. It is important to keep the text in three lines and, as much as possible, preserve the whitespace between words on each line.
Most of the linguistics papers that we find are PDF files, and the method we have been using so far is just to run a PDF to text converter that has a -layout option, and then extract the blocks of text that we are interested in from the converted papers. This works ok except that the PDF to text converter sometimes gets confused by accented Latin-1 characters, non-Latin-1 characters, and probably by the internal structure of some PDF files. In those cases it can convert some lines from the PDF file to anything from slightly corrupted text to garbage.
Recently I found a PDF to HTML converter that does a much better job of preserving the layout of the PDF files and avoiding text corruption, but of course its output is HTML. The HTML specifies locations of text on a page with DIV tags that look like this:
<DIV style="position:absolute;top:217;left:216">
I was thinking of using a Perl HTML parser to write a little application to convert the HTML to plain text while preserving the layout of the blocks of text as much as possible, but if something like that already exists, I'd like to just use it instead. I have tried out several HTML to text converters, and none of the ones I tried pay attention to information like the style attribute of the DIV tag above. I realize that it is not possible to put a block of plain text exactly 217 units down and 216 units to the right of a the edge of a page, but it would probably be sufficient to use that information to figure out which text should be above, below or on the same line as other text, and, to some extent, by how much. If you know of an HTML to text converter that takes this kind of information into account, could you please let me know? Alternatively, if you know of a PDF to text converter that does a good job preserving layout and handles non-ASCII text well, could you please let me know about that too? Thanks.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.