Re^2: Build a PDF book index

in reply to Re: Build a PDF book index
in thread Build a PDF book index

Last time I tried it, pdftohtml -xml was more accurate.

Update

See also Parsing PDFs by text position?

Cheers Rolf
_{(addicted to the Perl Programming Language and ☆☆☆☆ :)

Wikisyntax for the Monastery}

Comment on Re^2: Build a PDF book index Download Code

Replies are listed 'Best First'.
Re^3: Build a PDF book index by markong (Pilgrim) on Mar 17, 2018 at 16:09 UTC
Thank you, this tool extracted the text contents successfully, with apostrophes and (prolonged) dashes encoded as Latin-1!	[reply]
Re^4: Build a PDF book index by LanX (Saint) on Mar 17, 2018 at 16:51 UTC
You're welcome! Please note: The `-xml` switch gives you also the font-number and text- position in case you need to adjust characters like described. I had to do this in the past. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Wikisyntax for the Monastery}	[reply] [d/l]
Re^5: Build a PDF book index by markong (Pilgrim) on Mar 18, 2018 at 23:23 UTC
I am curious to what are you referring to. Are you referring to the content of the XML file output-ed ? e.g.: `<text top="78" left="108" width="540" height="21" font="3">The develop +er, on the other hand, feels like he’s interrupted several times a da +y for</text>` [download] Or are you talking about some cli option to give to the command? In this case, I don't see anything related (pdftohtml version 0.24.3).	[reply] [d/l]
Re^6: Build a PDF book index by LanX (Saint) on Mar 18, 2018 at 23:56 UTC

In Section Seekers of Perl Wisdom