tmoleary has asked for the wisdom of the Perl Monks concerning the following question:
and analyze them for the linguistic information that they contain. It is important to keep the text in three lines and, as much as possible, preserve the whitespace between words on each line.(1) Emine elma-yi ye-di. Emine apple-ACC eat-PAST.3sg `Emine ate the apple.'
I was thinking of using a Perl HTML parser to write a little application to convert the HTML to plain text while preserving the layout of the blocks of text as much as possible, but if something like that already exists, I'd like to just use it instead. I have tried out several HTML to text converters, and none of the ones I tried pay attention to information like the style attribute of the DIV tag above. I realize that it is not possible to put a block of plain text exactly 217 units down and 216 units to the right of a the edge of a page, but it would probably be sufficient to use that information to figure out which text should be above, below or on the same line as other text, and, to some extent, by how much. If you know of an HTML to text converter that takes this kind of information into account, could you please let me know? Alternatively, if you know of a PDF to text converter that does a good job preserving layout and handles non-ASCII text well, could you please let me know about that too? Thanks.<DIV style="position:absolute;top:217;left:216">
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Preserving layout in pdf to text or html to text conversion
by talexb (Chancellor) on Apr 10, 2007 at 17:40 UTC | |
|
Re: Preserving layout in pdf to text or html to text conversion
by samtregar (Abbot) on Apr 10, 2007 at 18:15 UTC | |
|
Re: Preserving layout in pdf to text or html to text conversion
by amaguk (Sexton) on Apr 10, 2007 at 21:19 UTC | |
|
Re: Preserving layout in pdf to text or html to text conversion
by Sixtease (Friar) on Apr 10, 2007 at 20:07 UTC | |
|
Re: Preserving layout in pdf to text or html to text conversion
by cbrandtbuffalo (Deacon) on Apr 10, 2007 at 20:11 UTC | |
|
Re: Preserving layout in pdf to text or html to text conversion
by Sixtease (Friar) on Apr 10, 2007 at 17:26 UTC |