1. How do I keep track of the articles that have already been downloaded?
Any way you please ;) Try AnyDBM_File (I like DB_File ) -- all you really need is a hash (perldoc perldata) or an array, .... ( you should really read one of merlyns articles, see below).
2. Should I use WWW::Mechanize for this or is it better to use the HTML::TableExtract and cycle through the Next link? Which is more efficient?
Depends.

If you can get to what you want by simply manipulating the URL (without any form submissions), then there is no need to involve WWW::Mechanize (LWP::Simple ought to do).

If you only want the links, use HTML::LinkExtor. If you also want the associated link text, then use HTML::LinkExtractor (the link would be "http://search.cpan.org/search?mode=module&query=HTML%3A%3ALinkExtractor" and the text "HTML::LinkExtractor").

If you want to extract more information, HTML::TableExtract might be appropriate (although I prefer HTML::TokeParser::Simple ).

3. I'm thinking of creating a weekly directory so I don't have all the links in one large directory
That sounds reasonable (whatever floats your boat).

merlyn has written a few articles on the subject, so you might wanna check them out at http://www.stonehenge.com/merlyn/.


MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests.
** The Third rule of perl club is a statement of fact: pod is sexy.


In reply to Re: Designing a web scrapper by PodMaster
in thread Designing a web scrapper by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.