in reply to Normalizing URLs

Keep in mind the point of the article: It's impossible to do URL normalizing well. They've even missed two items that need normalizing (but they might be in the linked spec):

1) The order of the arguments in a GET:
.../script.cgi?a=b&c=d vs
.../script.cgi?c=d&a=b

2) The domain name:
example.com vs
example.com. vs
EXAMPLE.COM

Oh and IP addresses too:
10.0.0.1 vs
0x0A000001 vs
167772161

Replies are listed 'Best First'.
Re^2: Normalizing URLs
by Anonymous Monk on Jul 21, 2005 at 14:46 UTC
    Some background - I am developing a scraper and it needs to know if it has scraped the page already - hence the normalization. It needn't be perfect, just good enough for all but the most arcane. Needs only work with http.

      What about using the "last_modified" method in LWP? Keep track of it locally. When you access the page again, check the time it was modified and skip it if that time is not newer than what you've saved.

      This idea is from "Spidering Hacks" (hack #16).