I worked for a search engine (the late, lamented Northern Light) and you've just triggered a few pet peeves.

First, if you're going to pay attention to meta tag keywords, you'd better be crawling only your own site. In general meta information is not just useless, it's actively deceptive. Trusting it generally is worse than a waste of time. You'll end up with a database full of lies. Sad, but true.

Second, if you're crawling pages that aren't yours, you'd better obey the robot rules. Use LWP::RobotUA here, with a reasonable time limit. The default minute delay's fine, but dropping it down as low as 10 or 20 seconds between requests is probably fine. (I'd leave it at the minute delay, personally)

If you're not going to use LWP::RobotUA, and are crawling other people's pages, then you'd darned well better make sure you space out the request. (Your ISP may want you do to this for your own pages--Snagging a couple of thousand pages over a cable modem or other reasonably high bandwidth connection can be pretty harsh)

If you're going to do it, then do it right, be polite, and respect the "Keep off the Grass" signs.


In reply to Re: Site Crawler by Elian
in thread Site Crawler by Frisbeeman

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.