in reply to Re^2: Extracting text from a PDF (using PDF::API2)
in thread Extracting text from a PDF (using PDF::API2)

as a side note, there is a PDF::API3 available on CPAN.

I have used CAM::PDF mainly for the tasks of extracting text. However, I have had little luck with embedded html in pdfs. You may be able to walk the root dictionary of the pdf using CAM::PDF and store information you need. There is also a module CAM::PDF::Renderer::Text that may be of some help

  • Comment on Re^3: Extracting text from a PDF (using PDF::API2)

Replies are listed 'Best First'.
Re^4: Extracting text from a PDF (using PDF::API2)
by music_man1352000 (Novice) on Dec 03, 2009 at 04:17 UTC
    Thanks for the suggestions tmaly.
    As I mentioned in the OP, I did look at CAM::PDF. The module you mentioned is kinda useful for rough output. However it doesn't handle layout at all (it is almost useless for my tabular data) and it can't handle the text encoding that I mentioned in my reply to desemondo. If I used it I'd still have to work at the stream level, but since it doesn't have any support for fonts I thought that PDF::API2 (which appears to support fonts) might be better. In other words, I was hoping I could avoid traversing the root dictionary (which if done properly would be a potentially huge job - to many object references!) by using PDF::API2.

    Just to clarify: the original documents for these PDFs are most likely to be Excel spreadsheets. They don't contain embedded HTML. "<\d+>" is just the way character \d+ (in the example, from font "F13") is encoded.