Designing a web scrapper

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, I'm learning to use the perl modules. As a learning step I'm interested in writing a scrapper to get articles from Hoovers once a week so I can read them offline during the weekend. I was hoping to get some design tips from more experienced coders on writing screen scrappers.

Here's what I know of the source. Hoovers keeps 30 days of articles online and displays 10 articles at a time. I can find the older article by following the link: http://hoovnews.hoovers.com/fp.asp?layout=chak&industry=Hoovers+Earnings+%26+Forecasts&starting=11&symbol=

By changing the starting=nn value, I can cycle through until a page error.

Some design decisions, I need to make:

1. How do I keep track of the articles that have already been downloaded? Should I have a simple file that keeps track of the last linked article? Or is it better to have a formatted index file that has the Link Text and the Timestamp of when it was d/led?

2. Should I use WWW::Mechanize for this or is it better to use the HTML::TableExtract and cycle through the Next link? Which is more efficient?

3. I'm thinking of creating a weekly directory so I don't have all the links in one large directory.

Any other gotchas or things to keep in mind?

Comment on Designing a web scrapper

Replies are listed 'Best First'.

Re: Designing a web scrapper
by PodMaster (Abbot) on May 17, 2003 at 10:22 UTC

1. How do I keep track of the articles that have already been downloaded?

perldata

merlyn

2. Should I use WWW::Mechanize for this or is it better to use the HTML::TableExtract and cycle through the Next link? Which is more efficient?

If you can get to what you want by simply manipulating the URL (without any form submissions), then there is no need to involve WWW::Mechanize (LWP::Simple ought to do).

If you only want the links, use HTML::LinkExtor. If you also want the associated link text, then use HTML::LinkExtractor (the link would be "http://search.cpan.org/search?mode=module&query=HTML%3A%3ALinkExtractor" and the text "HTML::LinkExtractor").

If you want to extract more information, HTML::TableExtract might be appropriate (although I prefer HTML::TokeParser::Simple ).

3. I'm thinking of creating a weekly directory so I don't have all the links in one large directory

merlyn has written a few articles on the subject, so you might wanna check them out at http://www.stonehenge.com/merlyn/.

MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests.
** The Third rule of perl club is a statement of fact: pod is sexy.

[reply]