PDF download

xan has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: PDF download by dragonchild (Archbishop) on Jan 07, 2004 at 20:47 UTC
Have you checked on CPAN? There's a bunch of modules to help you read PDFs. ------ We are the carpenters and bricklayers of the Information Age. Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.	[reply]
Re: Re: PDF download by Anonymous Monk on Jan 08, 2004 at 02:36 UTC
I did check CPAN, but they only have modules to create PDFs or manipulate them, but not to simply grab the content off the web. To be precise its the content I'm bothered with, I need the text each time, as I am working on information retrieval and parallel texts. cheers!	[reply]
Re: Re: Re: PDF download by dragonchild (Archbishop) on Jan 08, 2004 at 03:59 UTC
Take a look at http://search.cpan.org/~antro/PDF-111/examples/pagedump.pl. It's in the PDF distribution. I've never used it, but it says it can parse "all possible data occuring in a PDF". Some other options could be: PDF::Parse (though it doesn't look like it'll get your everywhere you want to go) pdf2text (there's a number of versions). You might have to convert it to parse it. The PDF format isn't that hard to parse. I mean, if PDF::API2 can build a PDF without very much convolution (outside of Unicode and fonts), one should be able to parse it relatively easily, I would think ... ------ We are the carpenters and bricklayers of the Information Age. Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.	[reply]
Re: PDF download by CountZero (Bishop) on Jan 07, 2004 at 20:27 UTC
Any chances we could have a look at your program? If it is too long, perhaps put it on your scratchpad? CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]
Re: PDF download by Roger (Parson) on Jan 08, 2004 at 02:54 UTC
Have a look at merlyn's WWW::Mechanize example on CPAN here. Look under the title 'get-despair, by Randal Schwartz'. Randal's example sucks down all the pictures, you only need minor modification to suck down html and PDF's, with the mirror method.	[reply]
Re: PDF download by jjhayes84 (Novice) on Jan 07, 2004 at 21:31 UTC
Do you just want to download the PDFs, or do you want to follow the links in PDF documents as well?	[reply]
Re: Re: PDF download by Anonymous Monk on Jan 08, 2004 at 02:33 UTC
Yes, I want to follow the links in PDF as well, But the spider does this already. It is simply the grabbing the PDF page bit so I also have a hard copy that is the problem. With html it works fine,it scours through links given a start link and then nabs all the pages it gets to. All links already spidered get put into a hash, so it doesn't go back there twice. I will giev you guys the code, It is long, so I can probably show u the bit doing the job in html. Its 2:33am right now, and I still havn't got further, so bed beckons!!! Thans guys!	[reply]