in reply to Designing a web scrapper
1. How do I keep track of the articles that have already been downloaded?Any way you please ;) Try AnyDBM_File (I like DB_File ) -- all you really need is a hash (perldoc perldata) or an array, .... ( you should really read one of merlyns articles, see below).
2. Should I use WWW::Mechanize for this or is it better to use the HTML::TableExtract and cycle through the Next link? Which is more efficient?Depends.
If you can get to what you want by simply manipulating the URL (without any form submissions), then there is no need to involve WWW::Mechanize (LWP::Simple ought to do).
If you only want the links, use HTML::LinkExtor. If you also want the associated link text, then use HTML::LinkExtractor (the link would be "http://search.cpan.org/search?mode=module&query=HTML%3A%3ALinkExtractor" and the text "HTML::LinkExtractor").
If you want to extract more information, HTML::TableExtract might be appropriate (although I prefer HTML::TokeParser::Simple ).
3. I'm thinking of creating a weekly directory so I don't have all the links in one large directoryThat sounds reasonable (whatever floats your boat).
merlyn has written a few articles on the subject, so you might wanna check them out at http://www.stonehenge.com/merlyn/.
|
MJD says you can't just make shit up and expect the computer to know what you mean, retardo! I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests. ** The Third rule of perl club is a statement of fact: pod is sexy. |
|
|---|