music_man1352000 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks

I know this topic has come up many times before in various flavours. I am sorry to bring it up again, however I need some specific help...

I am writing a script that requires me to extract the text from a number of PDFs that contain data in tabular (table) form. I have tried a number of approaches (CAM::PDF, pdftotext, pdftohtml etc.) however for various reasons none of these options has proved to be particularly suitable for me. Based on my attempts with the aforementioned approaches I have decided that what I am trying to do requires that I write custom code that works with the PDF on a stream level.

After spending a lot of time searching forums, reading code and experimenting with my own code I have been able to use PDF::API2 to access and decompress the content stream in a PDF. The next step is obviously to parse the stream to extract the text. I understand what this requires (in terms of the algorithm), however (as so often happens) I've hit a problem when dealing with the implementation...

The actual text content (characters) is encoded using fonts that are embedded in the PDF. I can't figure out how to use PDF::API2 to read the embedded font definitions so that I can use them to decode the text content. Can you please provide or direct me to an example that would help me.

Thanks in advance!
  • Comment on Extracting text from a PDF (using PDF::API2)

Replies are listed 'Best First'.
Re: Extracting text from a PDF (using PDF::API2)
by almut (Canon) on Dec 03, 2009 at 06:12 UTC

    You seem to have acquired a pretty good grasp of the subject matter already, so I'm not sure I'm telling you anything new...

    Anyhow, the problem you have is most likely related to font subsetting, because embedded fonts are typically also subsetted these days.  I've tried to explain in some more detail what the issue is in Re: CAM::PDF did't extract all pdf's content, so you might want to read this first. The next thing would be to find out whether you actually have those reverse mapping tables I mentioned in that node.  Note that they're optional; in other words, they're not required to render the fonts' glyphs (i.e. the text) properly (and I've more than once come across PDFs that didn't have them — presumably on purpose, to make it harder to mechanically extract the content...).

    Unfortunately, if you don't have those mapping tables, you're pretty much out of luck, because you'd have to generate them manually yourself, character for character (or maybe with the help of OCR, though I'm not aware of any ready-made tool for that purpose).

    In order to find out, either look at the PDF source (the respective objects should be referenced from the font descriptor via a dictionary entry named /ToUnicode), or check if Acrobat Reader's text tool can in fact extract selected text (cut-n-paste the selection into an editor, or some such). In case the tables are just not there, it won't be able to extract the text either.

    Once you've verified you have the tables, I'm afraid there's still work to do, i.e. write code that reads those tables and maps the subsetted encoding back to unicode.  AFAIK, PDF::API2 doesn't have support for this (it's more geared towards generating PDFs, rather than extracting content from existing PDFs), and I'm not aware of any other CPAN module that would help here either (though it's a while since I've last checked, so you might want to re-check recent additions to CPAN).

Re: Extracting text from a PDF (using PDF::API2)
by desemondo (Hermit) on Dec 03, 2009 at 01:51 UTC
    I dunno if you've looked at it yet but if you haven't take a gander at page 291 onwards of the PDF spec v1.7 on Adobe's website. spec 1.7

    also, page 238, 'Basics of Showing Text' had this to say...

    9.2.2 Basics of Showing Text EXAMPLE 1 This example illustrates the most straightforward use of a font. The t +ext ABC is placed 10 inches from the bottom of the page and 4 inches +from the left edge, using 12-point Helvetica. BT /F13 12 Tf 288 720 Td (ABC) Tj ET The five lines of this example perform these steps: a) Begin a text object. b) Set the font and font size to use, installing them as parameters in + the text state. In this case, the font resource identified by the na +me F13 specifies the font externally known as Helvetica. c) Specify a starting position on the page, setting parameters in the +text object. d) Paint the glyphs for a string of characters at that position. e) End the text object.
    Hope this is helpful to you.
      Thanks for the suggestion desemondo!
      I've been leaning very heavily on the Adobe document you referred to (especially the section you referenced) and I had actually seen that example too. Unfortunately it doesn't really help with my problem because my issue is PDF::API2-specific.

      I guess it might be helpful to rephrase my question: how do I use PDF::API2 to "import" the fonts that are embedded in a PDF so that I can map the character IDs used in the content stream to unicode code points (which can then be assembled into words and sentences...etc.)?

      To put the question in perspective, here is a simple example of what I'm dealing with. In the example you supplied, the characters "ABC" are in a known encoding. They would be encoded in the PDF exactly as the example states. However I am dealing with PDFs that do something like:
      BT /F13 12 Tf 288 720 Td [<01>4<02>-1<03>-1<04>-1<05>2<06>] TJ ET
      I need a way to use PDF::API2 to map the character IDs (the things between the angle brackets) to unicode code points...

        as a side note, there is a PDF::API3 available on CPAN.

        I have used CAM::PDF mainly for the tasks of extracting text. However, I have had little luck with embedded html in pdfs. You may be able to walk the root dictionary of the pdf using CAM::PDF and store information you need. There is also a module CAM::PDF::Renderer::Text that may be of some help