Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re: Normalizing URLs

by derby (Abbot)
on Jul 21, 2005 at 15:00 UTC ( [id://476867]=note: print w/replies, xml ) Need Help??


in reply to Normalizing URLs

I haven't tried it but wouldn't URI and it canonical and eq methods work for you?

Update: Looks like URI will not normalize query params. Something like this should work (note, I did not check all cases - feel free to fix!)

!/usr/bin/perl -wd use URI; my $u1 = URI->new("http://www.perl.com/cgi-bin/script.cgi?a=b&c=d"); my $u2 = URI->new("http://www.perl.com/cgi-bin/script.cgi?c=d&a=b"); my $u1c = $u1->canonical; my $u2c = $u2->canonical; if( urlsEqual( $u1c, $u2c ) ) { print "equal\n"; } else { print "not equal\n"; } sub urlsEqual { my( $u1, $u2 ) = @_; my( $q1, $q2 ); # First try URI eq return 1 if( $u1->eq( $u2 ) ); # nope ... adjust query $q1 = $u1->query(); $q2 = $u2->query(); $q1 = join( '&', sort( split( /[&;]/, $q1 ) ) ) if $q1; $q2 = join( '&', sort( split( /[&;]/, $q2 ) ) ) if $q2; $u1->query( $q1 ); $u2->query( $q2 ); return $u1->eq( $u2 ); }

-derby

Replies are listed 'Best First'.
Re^2: Normalizing URLs
by ikegami (Patriarch) on Jul 21, 2005 at 15:45 UTC

    From what I saw, URI

    • Lowercases the scheme.
    • Lowercases the domain name. (1)
    • Removes the port if it's the default. (2)
    • Removes port fields consisting of just ':'. (3)
    • Adds trailing '/' if no path or query is specified. (6, partial)

    • Doesn't do (4), (5), (7) and (8), but easy to do.
    • Doesn't do (9) and (10), but might not be possible.
    • Doesn't set the path to '/' if no path is specified and a query is specified. (6, partial)
    • Doesn't normalize IP addresses in to dotted form.
    • Doesn't remove the trailing '.' from domain names, if any.
    • Doesn't touch the query.
Re^2: Normalizing URLs
by Anonymous Monk on Jul 22, 2005 at 12:30 UTC
    You can't expect a module called 'URI' to normalize CGI parameters. http://foo.com/bar?a=b&c=d and http://foo.com/bar?c=d&a=b are two different URIs. The fact the two different URIs are treated the same by the receiving server is outside of the URI realm.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://476867]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (6)
As of 2024-03-28 22:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found