but ofcourse my test website had to come back with an error

One tip for developing scrapers: it's both convenient for you and polite to the site you're scraping to save a local copy that you can hammer at all you want without bothering their server. If you're scraping a lot of pages and doing a lot of tweaking on your code, you have the potential of really hammering someone's server. Once your extractor works, then you can put back the Mechanize calls to the site, which are probably not the hard part

In the example I gave upthread, it would have been ok for me to hammer the site, but I ended up cloning it with wget and running it locally.

Update: You might also want to see if the site you're scraping has an API that hands you structured data. I recently had to pull down the links for about 140 books from the Apple site, and they have a nice API that lets you search by ISBN. Amazon also tends to have an API for a lot of things. Other sites often do as well if you dig through the fine print at the bottom of the page.


In reply to Re^3: extracting data from HTML by bitingduck
in thread extracting data from HTML by Jurassic Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.