If you want a nice and complete discussion about writing spiders and parsing HTML you may want to look at the new O'Reilly tome Perl and LWP. This includes many examples of mining information from websites, ranging from using a few regexps to pull out the information, to rebuilding the HTML in tree from and throwing it out again, or spidering entire sites in the correct manner.

I've recently had to write a spider for work and whilst I'd got it working and doing what we needed this book pointed out a few things I'd over-looked thus allowing me to tighten things and cut down the chances of things falling to pieces. Well recommended.


In reply to Re: Weather goest thou, spider? by Molt
in thread Weather goest thou, spider? by jens

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.