in reply to Re: PDF download
in thread PDF download

I did check CPAN, but they only have modules to create PDFs or manipulate them, but not to simply grab the content off the web. To be precise its the content I'm bothered with, I need the text each time, as I am working on information retrieval and parallel texts. cheers!

Replies are listed 'Best First'.
Re: Re: Re: PDF download
by dragonchild (Archbishop) on Jan 08, 2004 at 03:59 UTC
    Take a look at http://search.cpan.org/~antro/PDF-111/examples/pagedump.pl. It's in the PDF distribution. I've never used it, but it says it can parse "all possible data occuring in a PDF".

    Some other options could be:

    • PDF::Parse (though it doesn't look like it'll get your everywhere you want to go)
    • pdf2text (there's a number of versions). You might have to convert it to parse it.
    • The PDF format isn't that hard to parse. I mean, if PDF::API2 can build a PDF without very much convolution (outside of Unicode and fonts), one should be able to parse it relatively easily, I would think ...

    ------
    We are the carpenters and bricklayers of the Information Age.

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.