Re: Scrapping web site - printing

A couple common tools for scraping would be LWP::UserAgent (or LWP::Simple) and WWW::Mechanize. WWW::Mechanize can automate the process of walking through the links for you, while with LWP you'll have to fetch the first page, parse out the links you want somehow, and then fetch those pages. Parsing can be done with a wide array of HTML/XML/DOM parsers/structure-builders. (I don't have one to recommend because there are so many popping up all the time that it seems like there's a new favorite every time I do such a task.) Parsing can also be done with regexes, but that's usually not recommended except for quick-and-dirty tasks. (Some will probably say that it's never recommended, but I recently had a task where I needed to parse a single value out of a page, and the overhead of loading the entire page into HTML::TreeBuilder caused a 10-fold increase in time used, compared to a single regex, so you have to decide each case for yourself.)

In either case, printing a series of pages to a single file is no big deal; just open the file at the beginning, print each one to it, and then close it. However, it's unlikely that you really want to print all of each page to the file, since in most cases that will include a lot of cruft like the HTML <head> section and the page's headers and footers and menus and so on. So again, you're probably going to want a parser/tree-builder of some sort to help you pluck out the section of the page that you actually want to save.

Aaron B.
Available for small or large Perl jobs; see my home node.

Comment on Re: Scrapping web site - printing - save to file Download Code