Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
Here's what I know of the source. Hoovers keeps 30 days of articles online and displays 10 articles at a time. I can find the older article by following the link: http://hoovnews.hoovers.com/fp.asp?layout=chak&industry=Hoovers+Earnings+%26+Forecasts&starting=11&symbol=
By changing the starting=nn value, I can cycle through until a page error.
Some design decisions, I need to make:
1. How do I keep track of the articles that have already been downloaded? Should I have a simple file that keeps track of the last linked article? Or is it better to have a formatted index file that has the Link Text and the Timestamp of when it was d/led?
2. Should I use WWW::Mechanize for this or is it better to use the HTML::TableExtract and cycle through the Next link? Which is more efficient?
3. I'm thinking of creating a weekly directory so I don't have all the links in one large directory.
Any other gotchas or things to keep in mind?
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Designing a web scrapper
by PodMaster (Abbot) on May 17, 2003 at 10:22 UTC |