I'm building a complex web spider that will take content pointed to and build a binary representation of that data for display on a Palm handheld device.

During spidering, I'm trying to go through the links found, and skip any that happen to be "duplicate" links. Things like www.slashdot.org, which metas to slashdot.org (a 302), or www.cnn.com, which is the same content as cnn.com.

What's the easiest way to programatically determine if the link is a duplicate of another already stored in the hash of links already extracted from the page, without incurring a hit to the site itself to crc the content or HEAD (which both have their own flaws in design). I don't want to retrieve the same content twice, if one page links to www.foo.com, and another page in the same session links to foo.com.

Is this possible? Some magic foo with the URI module? I'm already using URI to validate that the URL is indeed properly formatted (and not file://etc/foo or ftp://foo and so on), but I'd like to eliminate any dupes during link extraction time, even with a HEAD request, before I spider them with GET (though I'd like to eliminate the double-hit with HEAD then GET on the same links).

Note, www.foo.com and foo.com may not be the same content, so I can't just regex off the 'www.' from the front of "similar" URLs.

Has anyone done anything like this before?


In reply to Eliminating "duplicate" domains from a hash/array by hacker

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.