Re: PDF Parser

in reply to PDF Parser

My best bet is to use pdftohtml -xml and to parse the xml.

See also Parsing PDFs by text position?

Cheers Rolf

( addicted to the Perl Programming Language)

Comment on Re: PDF Parser Download Code

Replies are listed 'Best First'.
Re^2: PDF Parser by wollmers (Scribe) on Mar 18, 2014 at 12:35 UTC
Thx for the tip. Maybe I can solve one of my open problems this way: reconstruct the text of a book in Yiddish (accented Hebrew), where the accents are added by position. With pdftotext the accents appear at the end of the line.	[reply]
Re^3: PDF Parser by LanX (Saint) on Mar 18, 2014 at 13:20 UTC
Well while learning to read Yiddish is on my to-do list, I never thought about doing it via PDF ;) The C sources of pdftohtml are pretty compact calls to something like ghostscript (IIRC)š so porting it to Perl in order to have tighter control shouldn't be a problem. HTH :) Cheers Rolf ( addicted to the Perl Programming Language) update nope it's XPDF! :)	[reply]

In Section Seekers of Perl Wisdom