in reply to Extracting text from a PDF (using PDF::API2)
You seem to have acquired a pretty good grasp of the subject matter already, so I'm not sure I'm telling you anything new...
Anyhow, the problem you have is most likely related to font subsetting, because embedded fonts are typically also subsetted these days. I've tried to explain in some more detail what the issue is in Re: CAM::PDF did't extract all pdf's content, so you might want to read this first. The next thing would be to find out whether you actually have those reverse mapping tables I mentioned in that node. Note that they're optional; in other words, they're not required to render the fonts' glyphs (i.e. the text) properly (and I've more than once come across PDFs that didn't have them — presumably on purpose, to make it harder to mechanically extract the content...).
Unfortunately, if you don't have those mapping tables, you're pretty much out of luck, because you'd have to generate them manually yourself, character for character (or maybe with the help of OCR, though I'm not aware of any ready-made tool for that purpose).
In order to find out, either look at the PDF source (the respective objects should be referenced from the font descriptor via a dictionary entry named /ToUnicode), or check if Acrobat Reader's text tool can in fact extract selected text (cut-n-paste the selection into an editor, or some such). In case the tables are just not there, it won't be able to extract the text either.
Once you've verified you have the tables, I'm afraid there's still work to do, i.e. write code that reads those tables and maps the subsetted encoding back to unicode. AFAIK, PDF::API2 doesn't have support for this (it's more geared towards generating PDFs, rather than extracting content from existing PDFs), and I'm not aware of any other CPAN module that would help here either (though it's a while since I've last checked, so you might want to re-check recent additions to CPAN).
|
|---|