Reading PDF (taboo?)

Mondongo has asked for the wisdom of the Perl Monks concerning the following question:

Hello there monks.

I've been googling, cpanning, perlmonking, lookin' all around for information. I have an PDF file and it would be very nice to extract information, make some HTMLs out of it, the works.

But I haven't been able to find squat. There's PDF-111, but PDF::Core I don't understand, and PDF::Parse doesn't seem to do much but to give me the number of pages of the PDF and things like that

I'm thinking Adobe, Dmitry Skylarov (sp?), is that why no one wants to mess with PDF? Found a lot of tools to convert TO this format, but almost none to convert FROM.

Any ideas? Should I commit myself to the police, under the DMCA? :-)

Mondongo

Comment on Reading PDF (taboo?)

Replies are listed 'Best First'.
Re: Reading PDF (taboo?) by dragonchild (Archbishop) on May 19, 2004 at 20:31 UTC
You're looking for PDF::API2. The documentation is lacking, but the source is relatively easy to read. ------ We are the carpenters and bricklayers of the Information Age. Then there are Damian modules.... sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon.* - flyingmoose I shouldn't have to say this, but any code, unless otherwise stated, is untested	[reply]
Re: Re: Reading PDF (taboo?) by Anonymous Monk on May 20, 2004 at 11:47 UTC
Rather PDF::API2	[reply]
Re: Re: Reading PDF (taboo?) by Mondongo (Beadle) on May 19, 2004 at 20:34 UTC
Thank you very much. Mondongo	[reply]
Re: Reading PDF (taboo?) by drake50 (Pilgrim) on May 19, 2004 at 21:16 UTC
You could use something like pdftohtml to convert to html and then parse...	[reply]
Re: Re: Reading PDF (taboo?) by Mondongo (Beadle) on May 19, 2004 at 21:38 UTC
Indeed But the PDF I'm working with has two (or more) columns of text, it's sort of a tabloid, and pdf2html makes a big mess out of it. I've been reading RTFs and G*d knows it's an awful format, so I didn't expect PDF, which I thought an open format, to be so... secretive! Now I've downloaded PDF::API2, but the documentation's sort of cryptic. I'll have to hack my way through! :) Thanks for answering. Mondongo	[reply]
Re: Re: Re: Reading PDF (taboo?) by drewbie (Chaplain) on May 20, 2004 at 05:16 UTC
At $dayjob we mostly use htmldoc to create our PDFs. So we build the HTML and then run it through htmldoc to get the PDF. It's not perfect, and you don't get all the control like you can with PDF::API2. But one advantage of this method is that it gives us a web-accessible version for free! And if you have an existing PDF you want to add pages to, look at importpage(). In our case, we have an existing report in PDF format that we want to add to a dynamic PDF document. It generally works great (we're using an older release since the latest require 5.8+), although I occasionally find it goes into deep recursion on some file when importing the pages. Haven't figured why, but my workaround is just to do more in htmldoc. :-)	[reply]
Re: Re: Re: Re: Reading PDF (taboo?) by dragonchild (Archbishop) on May 20, 2004 at 11:54 UTC
Re: Reading PDF (taboo?) by arunmep (Beadle) on Oct 18, 2006 at 11:19 UTC
There is way to read this if your aim to search for a particular string this will work. download pdf2txt.exe i got it from the directory of google desktop. use backtick operators qx/pdf2txt a.pdf/ the pdf will be converted to text open the text file read the contents search it. This is the way I did and it worked	[reply]