in reply to PDF Parser

My best bet is to use pdftohtml -xml and to parse the xml.

See also Parsing PDFs by text position?

Cheers Rolf

( addicted to the Perl Programming Language)

Replies are listed 'Best First'.
Re^2: PDF Parser
by wollmers (Scribe) on Mar 18, 2014 at 12:35 UTC

    Thx for the tip.

    Maybe I can solve one of my open problems this way: reconstruct the text of a book in Yiddish (accented Hebrew), where the accents are added by position. With pdftotext the accents appear at the end of the line.

      Well while learning to read Yiddish is on my to-do list, I never thought about doing it via PDF ;)

      The C sources of pdftohtml are pretty compact calls to something like ghostscript (IIRC)¹ so porting it to Perl in order to have tighter control shouldn't be a problem.

      HTH :)

      Cheers Rolf

      ( addicted to the Perl Programming Language)

      update

      nope it's XPDF! :)