in reply to Build a PDF book index

Tl;dr, but

>  I've noticed that some characters aren't as expected when extracted: 

PDF allows to embed it's own fonts, and the encoding of characters is sometimes random then.

You can solve it for a specific PDF document only by scanning the affected font number and manually building a translation table into a hash.

HTH! :)

Cheers Rolf
(addicted to the Perl Programming Language and ☆☆☆☆ :)
Wikisyntax for the Monastery