PDF::API2 traversing object tree and parsing text

RandomMonkey has asked for the wisdom of the Perl Monks concerning the following question:

Does anybody have (or could write) a simple PDF::API2 main example that will traverse all the objects of an opened pdf file? I have spent some time now reading through the docs and googling for examples and I so far have either found no existing results (or have not discovered a proper google/PerlMonks search string).

More specifically, what I would like to do is open a pdf file and be able to traverse the (at least text) objects of the file and extract the info from each internal object (and now that I have identified these internal objects, be able to perhaps identify other meta information from the specific object). PDF::API2 appears to have this capability, but I have so far not been able to figure out how to do it.

I have already scoured Google and PerlMonks for information. I have found many really good threads discussing creating pdf documents. And even some threads that appear to answer my question, but all of the cited examples are no longer reachable.

I am still a bit hazy about how to even identify the display objects in the $pdf object. Ideally, I would like an example that opens a pdf with PDF::API2, shows how to identify/navigate through the object list/tree(?) and show how to extract text (and/or other information) from these internal objects.

Thanks in advance for any clues or help. :-D

Comment on PDF::API2 traversing object tree and parsing text

Replies are listed 'Best First'.
Re: PDF::API2 traversing object tree and parsing text by chrestomanci (Priest) on Sep 02, 2011 at 09:09 UTC
Where did you read that PDF::API2 is able to read pdf documents and extract content from them? I am not saying it is impossible, just that I can't see any suggestion that it is from the docs on CPAN, or in PDF::API2::HOWTO. Another approach you could take, especially if you just want the text from the PDF would be to convert it another format and parse that format with perl. For example a google search for pdf2svg returns an open source command line tool for the purpose, and also wikipedia instructions on how to manually convert using inkscape. As svg is an XML based format you should be able to find plenty of perl libraries and tutorials that will help you extract what you need.	[reply]
Re^2: PDF::API2 traversing object tree and parsing text by Anonymous Monk on Sep 02, 2011 at 09:38 UTC
PDF::API2 - Facilitates the creation and modification of PDF files `$pdf = PDF::API->open $pdffile` [download]	[reply] [d/l]
Re^3: PDF::API2 traversing object tree and parsing text by chrestomanci (Priest) on Sep 02, 2011 at 12:01 UTC
Facilitates the creation and modification of PDF files I saw that in the PDF::API2 docs as well, however modification does not imply reading. It looks to me as if modification is limited to adding elements to and existing document, such as extra pages with new content, or overprinting existing pages with extra text or pictures. To give an analogy, this is like using a printing press to modify an existing printed document. You can print something else on the back, attach extra pages, or even overprint on the front, obliterating anything already there, but the press does not read the document and edit intelligently, it just adds to it. There are method calls in PDF::API2 to read metadata such as `$pdf->preferences(%options)`, `$pdf->default($parameter)` and `$pdf->info(%infohash)` but I think the OP wants more than just metadata. As I say, I would be happy to be corrected, but as yet I have seen no evidence that PDF::API2 is able to read and process the contents of a PDF document.	[reply] [d/l] [select]
Re^4: PDF::API2 traversing object tree and parsing text by Anonymous Monk on Sep 02, 2011 at 13:27 UTC
Re^2: PDF::API2 traversing object tree and parsing text by Lotus1 (Vicar) on Sep 02, 2011 at 13:14 UTC
I haven't tried it but this looks promising. Edit PDF Text This looks more promising. CAM::PDF	[reply]
Re^3: PDF::API2 traversing object tree and parsing text by Anonymous Monk on Oct 12, 2011 at 19:23 UTC
unfortunately, CAM::PDF only handles PDF files up to version 1.5 of the PDF spec. most of the files i deal with nowadays are 1.6 or 1.7. note that the 1.6 spec was created in 2005 and the author does not intend on adding support for the current spec.	[reply]