in reply to Extracting text from a PDF (using PDF::API2)

I dunno if you've looked at it yet but if you haven't take a gander at page 291 onwards of the PDF spec v1.7 on Adobe's website. spec 1.7

also, page 238, 'Basics of Showing Text' had this to say...

9.2.2 Basics of Showing Text EXAMPLE 1 This example illustrates the most straightforward use of a font. The t +ext ABC is placed 10 inches from the bottom of the page and 4 inches +from the left edge, using 12-point Helvetica. BT /F13 12 Tf 288 720 Td (ABC) Tj ET The five lines of this example perform these steps: a) Begin a text object. b) Set the font and font size to use, installing them as parameters in + the text state. In this case, the font resource identified by the na +me F13 specifies the font externally known as Helvetica. c) Specify a starting position on the page, setting parameters in the +text object. d) Paint the glyphs for a string of characters at that position. e) End the text object.
Hope this is helpful to you.

Replies are listed 'Best First'.
Re^2: Extracting text from a PDF (using PDF::API2)
by music_man1352000 (Novice) on Dec 03, 2009 at 03:37 UTC
    Thanks for the suggestion desemondo!
    I've been leaning very heavily on the Adobe document you referred to (especially the section you referenced) and I had actually seen that example too. Unfortunately it doesn't really help with my problem because my issue is PDF::API2-specific.

    I guess it might be helpful to rephrase my question: how do I use PDF::API2 to "import" the fonts that are embedded in a PDF so that I can map the character IDs used in the content stream to unicode code points (which can then be assembled into words and sentences...etc.)?

    To put the question in perspective, here is a simple example of what I'm dealing with. In the example you supplied, the characters "ABC" are in a known encoding. They would be encoded in the PDF exactly as the example states. However I am dealing with PDFs that do something like:
    BT /F13 12 Tf 288 720 Td [<01>4<02>-1<03>-1<04>-1<05>2<06>] TJ ET
    I need a way to use PDF::API2 to map the character IDs (the things between the angle brackets) to unicode code points...

      as a side note, there is a PDF::API3 available on CPAN.

      I have used CAM::PDF mainly for the tasks of extracting text. However, I have had little luck with embedded html in pdfs. You may be able to walk the root dictionary of the pdf using CAM::PDF and store information you need. There is also a module CAM::PDF::Renderer::Text that may be of some help

        Thanks for the suggestions tmaly.
        As I mentioned in the OP, I did look at CAM::PDF. The module you mentioned is kinda useful for rough output. However it doesn't handle layout at all (it is almost useless for my tabular data) and it can't handle the text encoding that I mentioned in my reply to desemondo. If I used it I'd still have to work at the stream level, but since it doesn't have any support for fonts I thought that PDF::API2 (which appears to support fonts) might be better. In other words, I was hoping I could avoid traversing the root dictionary (which if done properly would be a potentially huge job - to many object references!) by using PDF::API2.

        Just to clarify: the original documents for these PDFs are most likely to be Excel spreadsheets. They don't contain embedded HTML. "<\d+>" is just the way character \d+ (in the example, from font "F13") is encoded.