A couple common tools for scraping would be LWP::UserAgent (or LWP::Simple) and WWW::Mechanize. WWW::Mechanize can automate the process of walking through the links for you, while with LWP you'll have to fetch the first page, parse out the links you want somehow, and then fetch those pages. Parsing can be done with a wide array of HTML/XML/DOM parsers/structure-builders. (I don't have one to recommend because there are so many popping up all the time that it seems like there's a new favorite every time I do such a task.) Parsing can also be done with regexes, but that's usually not recommended except for quick-and-dirty tasks. (Some will probably say that it's never recommended, but I recently had a task where I needed to parse a single value out of a page, and the overhead of loading the entire page into HTML::TreeBuilder caused a 10-fold increase in time used, compared to a single regex, so you have to decide each case for yourself.)

In either case, printing a series of pages to a single file is no big deal; just open the file at the beginning, print each one to it, and then close it. However, it's unlikely that you really want to print all of each page to the file, since in most cases that will include a lot of cruft like the HTML <head> section and the page's headers and footers and menus and so on. So again, you're probably going to want a parser/tree-builder of some sort to help you pluck out the section of the page that you actually want to save.

Aaron B.
Available for small or large Perl jobs; see my home node.


In reply to Re: Scrapping web site - printing - save to file by aaron_baugher
in thread Scrapping web site - printing - save to file by locust

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.