in reply to Re: Eliminating "duplicate" domains from a hash/array
in thread Eliminating "duplicate" domains from a hash/array
Also, if my spider takes an hour to fetch a series of links from a site, and the first link is freshmeat.net and the last link in the fetch (an hour later) is www.freshmeat.net which now has a few extra things added to the top of the page since the fetch began (as they always do), the content will be different, but there is no need to fetch it again, since I already have "mostly" current content in the first link I fetched during this session.
I realize HEAD information is also not the best approach, because:
It's definately a tricky subject, but I'm sure there are some ways to avoid doing this.
One I thought of while I was sleeping last night, was to constantly maintain a small Berkeley dbm (or flat file) of hosts and potential "duplicate" URIs which they are known to come from, and keep that current on the client side, and check that each time I start the spider up to crawl new content.
|
---|