Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re: Normalizing URLs

by ikegami (Patriarch)
on Jul 21, 2005 at 14:43 UTC ( [id://476857]=note: print w/replies, xml ) Need Help??


in reply to Normalizing URLs

Keep in mind the point of the article: It's impossible to do URL normalizing well. They've even missed two items that need normalizing (but they might be in the linked spec):

1) The order of the arguments in a GET:
.../script.cgi?a=b&c=d vs
.../script.cgi?c=d&a=b

2) The domain name:
example.com vs
example.com. vs
EXAMPLE.COM

Oh and IP addresses too:
10.0.0.1 vs
0x0A000001 vs
167772161

Replies are listed 'Best First'.
Re^2: Normalizing URLs
by Anonymous Monk on Jul 21, 2005 at 14:46 UTC
    Some background - I am developing a scraper and it needs to know if it has scraped the page already - hence the normalization. It needn't be perfect, just good enough for all but the most arcane. Needs only work with http.

      What about using the "last_modified" method in LWP? Keep track of it locally. When you access the page again, check the time it was modified and skip it if that time is not newer than what you've saved.

      This idea is from "Spidering Hacks" (hack #16).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://476857]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (5)
As of 2024-04-19 07:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found