You seem to have acquired a pretty good grasp of the subject matter already, so I'm not sure I'm telling you anything new...

Anyhow, the problem you have is most likely related to font subsetting, because embedded fonts are typically also subsetted these days.  I've tried to explain in some more detail what the issue is in Re: CAM::PDF did't extract all pdf's content, so you might want to read this first. The next thing would be to find out whether you actually have those reverse mapping tables I mentioned in that node.  Note that they're optional; in other words, they're not required to render the fonts' glyphs (i.e. the text) properly (and I've more than once come across PDFs that didn't have them — presumably on purpose, to make it harder to mechanically extract the content...).

Unfortunately, if you don't have those mapping tables, you're pretty much out of luck, because you'd have to generate them manually yourself, character for character (or maybe with the help of OCR, though I'm not aware of any ready-made tool for that purpose).

In order to find out, either look at the PDF source (the respective objects should be referenced from the font descriptor via a dictionary entry named /ToUnicode), or check if Acrobat Reader's text tool can in fact extract selected text (cut-n-paste the selection into an editor, or some such). In case the tables are just not there, it won't be able to extract the text either.

Once you've verified you have the tables, I'm afraid there's still work to do, i.e. write code that reads those tables and maps the subsetted encoding back to unicode.  AFAIK, PDF::API2 doesn't have support for this (it's more geared towards generating PDFs, rather than extracting content from existing PDFs), and I'm not aware of any other CPAN module that would help here either (though it's a while since I've last checked, so you might want to re-check recent additions to CPAN).


In reply to Re: Extracting text from a PDF (using PDF::API2) by almut
in thread Extracting text from a PDF (using PDF::API2) by music_man1352000

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.