Mondongo has asked for the wisdom of the Perl Monks concerning the following question:

Hello there monks.

I've been googling, cpanning, perlmonking, lookin' all around for information. I have an PDF file and it would be very nice to extract information, make some HTMLs out of it, the works.

But I haven't been able to find squat. There's PDF-111, but PDF::Core I don't understand, and PDF::Parse doesn't seem to do much but to give me the number of pages of the PDF and things like that

I'm thinking Adobe, Dmitry Skylarov (sp?), is that why no one wants to mess with PDF? Found a lot of tools to convert TO this format, but almost none to convert FROM.

Any ideas? Should I commit myself to the police, under the DMCA? :-)


Mondongo

Replies are listed 'Best First'.
Re: Reading PDF (taboo?)
by dragonchild (Archbishop) on May 19, 2004 at 20:31 UTC
    You're looking for PDF::API2. The documentation is lacking, but the source is relatively easy to read.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

    I shouldn't have to say this, but any code, unless otherwise stated, is untested

      Thank you very much.


      Mondongo

Re: Reading PDF (taboo?)
by drake50 (Pilgrim) on May 19, 2004 at 21:16 UTC
    You could use something like pdftohtml to convert to html and then parse...

      Indeed

      But the PDF I'm working with has two (or more) columns of text, it's sort of a tabloid, and pdf2html makes a big mess out of it.

      I've been reading RTFs and G*d knows it's an awful format, so I didn't expect PDF, which I thought an open format, to be so... secretive!

      Now I've downloaded PDF::API2, but the documentation's sort of cryptic. I'll have to hack my way through! :)


      Thanks for answering.


      Mondongo

        At $dayjob we mostly use htmldoc to create our PDFs. So we build the HTML and then run it through htmldoc to get the PDF. It's not perfect, and you don't get all the control like you can with PDF::API2. But one advantage of this method is that it gives us a web-accessible version for free!

        And if you have an existing PDF you want to add pages to, look at importpage(). In our case, we have an existing report in PDF format that we want to add to a dynamic PDF document. It generally works great (we're using an older release since the latest require 5.8+), although I occasionally find it goes into deep recursion on some file when importing the pages. Haven't figured why, but my workaround is just to do more in htmldoc. :-)

Re: Reading PDF (taboo?)
by arunmep (Beadle) on Oct 18, 2006 at 11:19 UTC
    There is way to read this if your aim to search for a particular string this will work. download pdf2txt.exe i got it from the directory of google desktop. use backtick operators qx/pdf2txt a.pdf/ the pdf will be converted to text open the text file read the contents search it. This is the way I did and it worked