•Re: Eliminating "duplicate" domains from a hash/array

You can't do it precisely programmatically. You have to determine that two pages are "close enough" when you hit them. I know Google knows to do that, but I've run into other web walkers that don't.

I found that out by putting a link on my webserver root page to -. And I symlinked "-" to "." in my root doc directory. So any page on my website was accessible by any number of /-/-/-/-/- prefix chars before the real URL. Google immediately figured it out, but I had other webcrawlers visiting (and indexing!) my entire web site some 15 or 20 times deep before giving up.

If you are spidering your own site, you can add code in your spider to canonicalize your URLs before fetching. I did that in a few of my columns.

-- Randal L. Schwartz, Perl hacker
Be sure to read my standard disclaimer if this is a reply.

Comment on •Re: Eliminating "duplicate" domains from a hash/array

Replies are listed 'Best First'.
Re: •Re: Eliminating "duplicate" domains from a hash/array by pg (Canon) on Mar 31, 2003 at 02:59 UTC
For the case of symlink '-' to '.', that is obviously a kind of problem that can be resolved precisely programmatically. The fact that Google can resolve it, clearly shows this is resolvable; The fact that others cannot deliver the same thing, only means their programs are not smart enough. We have to clearly identify what is logically doable, and what is not. Something nobody handles or somebody handled badly does not necessary to be logically unresolvable. The actual difficulty to compare URL's, has really nothing to do with this kind of small trick, which is obviously logically and programmatically resolvable. The real problem is that, the solution to this kind of issue is largely related to the internal structure of each particular site, which is not regulate by any standard, and could be so different from site to site. We have to realize/remember that no search engine is just a set of programs, instead it is a set of programs + MANUALLY MAINTAINED INFOS. Without those MANUALLY MAINTAINED INFOS, there is no google or any other search engine.	[reply]
Re: •Re: Eliminating "duplicate" domains from a hash/array by bsb (Priest) on Mar 31, 2003 at 11:21 UTC
I'm really curious, why were you doing this? Brad	[reply]
•Re: Re: •Re: Eliminating "duplicate" domains from a hash/array by merlyn (Sage) on Mar 31, 2003 at 15:41 UTC
My usual impish self wanted to see how deep the indexers would go, and to see if it would cause my site to be multiply indexed. -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply]