in reply to PDF::API2 traversing object tree and parsing text

Where did you read that PDF::API2 is able to read pdf documents and extract content from them? I am not saying it is impossible, just that I can't see any suggestion that it is from the docs on CPAN, or in PDF::API2::HOWTO.

Another approach you could take, especially if you just want the text from the PDF would be to convert it another format and parse that format with perl. For example a google search for pdf2svg returns an open source command line tool for the purpose, and also wikipedia instructions on how to manually convert using inkscape. As svg is an XML based format you should be able to find plenty of perl libraries and tutorials that will help you extract what you need.

  • Comment on Re: PDF::API2 traversing object tree and parsing text

Replies are listed 'Best First'.
Re^2: PDF::API2 traversing object tree and parsing text
by Anonymous Monk on Sep 02, 2011 at 09:38 UTC

    PDF::API2 - Facilitates the creation and modification of PDF files

    $pdf = PDF::API->open $pdffile

      Facilitates the creation and modification of PDF files

      I saw that in the PDF::API2 docs as well, however modification does not imply reading. It looks to me as if modification is limited to adding elements to and existing document, such as extra pages with new content, or overprinting existing pages with extra text or pictures.

      To give an analogy, this is like using a printing press to modify an existing printed document. You can print something else on the back, attach extra pages, or even overprint on the front, obliterating anything already there, but the press does not read the document and edit intelligently, it just adds to it.

      There are method calls in PDF::API2 to read metadata such as $pdf->preferences(%options), $pdf->default($parameter) and $pdf->info(%infohash) but I think the OP wants more than just metadata.

      As I say, I would be happy to be corrected, but as yet I have seen no evidence that PDF::API2 is able to read and process the contents of a PDF document.

Re^2: PDF::API2 traversing object tree and parsing text
by Lotus1 (Vicar) on Sep 02, 2011 at 13:14 UTC