in reply to Sorting URLs on domain/host: sortkeys generation

All you need is a basic cmp sort (for domain name based URLs):

print for sort <DATA>; __DATA__ http://google.com http://google.com/groups http://google.com/groups/deeper http://msn.com http://msn.com/groups http://msn.com/groups/deeper http://apache.org http://apache.org/docs http://apache.org/docs/mod_perl

Which gives:

http://apache.org http://apache.org/docs http://apache.org/docs/mod_perl http://google.com http://google.com/groups http://google.com/groups/deeper http://msn.com http://msn.com/groups http://msn.com/groups/deeper

For numerical addresses you need to sort on 1) the integer representation of the 4 byte value that corresponds to the IP address then 2) the rest of the URL (if any). This is a little more complex and uses a Schwartzian transform for efficiency. I have assumed dot quads - it you have to deal with other stuff like "127.1" and all the other types of valid IPs Use Socket; my ($ip) = unpack "N", inet_aton($1) This will probably be a little slower than the raw unpack/pack/split presented.

my @data = qw( http://3.3.3.3/docs/mod_perl http://3.3.3.3/docs http://3.3.3.3 http://2.2.2.2 http://10.1.1.1 http://11.1.1.1 http://2.2.2.2/groups http://2.2.2.2/groups/deeper http://1.1.1.1/groups/deeper http://1.1.1.1/groups http://1.1.1.1 http://1.1.1.2 http://1.1.2.1 ); #use Socket; @sorted = map { $_->[0] } sort { $a->[1] <=> $b->[1] || $a->[2] cmp $b->[2] } map { munge_url($_) } @data; print "$_\n" for @sorted; sub munge_url { my $addr = $_[0]; $addr =~ m!^(?:\w+://)?([^/]+)/?(.*)$!; # convert dot quad to a sortable integer my ($ip) = unpack 'N', pack 'C4', split '\.',$1; # or unpack 'N', +inet_aton($1); my $rest = $2 || ''; print "$ip $rest\n"; return [ $_, $ip, $rest ] } __DATA__ http://1.1.1.1 http://1.1.1.1/groups http://1.1.1.1/groups/deeper http://1.1.1.2 http://1.1.2.1 http://2.2.2.2 http://2.2.2.2/groups http://2.2.2.2/groups/deeper http://3.3.3.3 http://3.3.3.3/docs http://3.3.3.3/docs/mod_perl http://10.1.1.1 http://11.1.1.1

There is no logical relation between fqdns and dot quad IPs (sort wise) until you resolve the IPs to fqdns.

cheers

tachyon

s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Replies are listed 'Best First'.
Re: Re: Sorting URLs on domain/host: sortkeys generation
by parv (Parson) on Mar 30, 2003 at 08:40 UTC

    Yes, i would have been satisfied by a simple sort if the domain/host names were of type scheme :// word 1 . word 2 /? (blah)?. In reallity, there are often more than two words say www.google.com or www.pacareerlink.state.pa.us. Thus the reason for non-trivial sorting.

    Thanks for the tip for sorting by IP addresses though.

      Actually it does not matter that there are \w\.\w... sequences as . sorts before \w you get the desired result. The http:// is also immaterial provided all entries either have (or don't have it). cmp sorting does not stop at the first non word - it simply sorts in ASCII order.

      print "$_\n" for sort qw ( http://. http://www.google.com http://www.google.co.uk http://au.google.com http://au.goo.com http://au.goop.com ); __DATA__ http://. http://au.goo.com http://au.google.com http://au.goop.com http://www.google.co.uk http://www.google.com

      This looks appropriately sorted to me. The IP code will get you the domain (or IP) in $1 regardless so you can easily modify it, but as this shows you don't really need to unless you want to trim off the ftp:// http:// https:// part and thus lump these in one group. The only other modification you can do to the domain name is chop the www. off (trying to guess other subdomains is a hopeless task) Otherwise the default cmp should work fine. Perhaps you could post an example of where it is not?

      cheers

      tachyon

      s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

        I'm quite sure the OP wanted something like:

        http://. http://au.goo.com http://au.goop.com http://au.google.com http://www.google.com http://www.google.co.uk

        That is, sorted by TLD, then 2nd level domain, then 3rd, and so on. At least, it's what his code does (notice the reverse).

        -- 
                dakkar - Mobilis in mobile
        
        http://aset.its.psu.edu/announcements/newsgroup_changes.html
        http://aset.its.psu.edu/unix_group/
        http://aset.its.psu.edu/unix_group/unixaccounts.html
        http://aset.psu.edu/ait/
        http://aset.psu.edu/ait/filesys.html
        http://aset.psu.edu/unix_group/lsfaqs.html
        http://aset.psu.edu/unix_group/quickunix.html
        http://cac.psu.edu/
        http://cac.psu.edu/publish/htpasswd/alternate.html
        http://clc.its.psu.edu/
        http://clc.its.psu.edu/Labs/
        http://clc.its.psu.edu/Labs/Mac/
        http://clc.its.psu.edu/labs/Mac/software/all.aspx
        http://clc.its.psu.edu/labs/Mac/software/default.aspx
        http://css.its.psu.edu/internet/
        http://css.its.psu.edu/internet/unix.html
        http://css.its.psu.edu/news/alerts/
        http://css.its.psu.edu/news/alerts/K4notice.html
        http://its.psu.edu/
        http://its.psu.edu/computing.html
        http://its.psu.edu/learning.html
        http://search.psu.edu/query.html
        

        ...is sorted by current algorithm (in OP) in the following desired order...

        http://aset.psu.edu/ait/
        http://aset.psu.edu/ait/filesys.html
        http://aset.psu.edu/unix_group/lsfaqs.html
        http://aset.psu.edu/unix_group/quickunix.html
        http://cac.psu.edu/
        http://cac.psu.edu/publish/htpasswd/alternate.html
        http://aset.its.psu.edu/announcements/newsgroup_changes.html
        http://aset.its.psu.edu/unix_group/
        http://aset.its.psu.edu/unix_group/unixaccounts.html
        http://clc.its.psu.edu/
        http://clc.its.psu.edu/Labs/
        http://clc.its.psu.edu/Labs/Mac/
        http://clc.its.psu.edu/labs/Mac/software/all.aspx
        http://clc.its.psu.edu/labs/Mac/software/default.aspx
        http://css.its.psu.edu/internet/
        http://css.its.psu.edu/internet/unix.html
        http://css.its.psu.edu/news/alerts/
        http://css.its.psu.edu/news/alerts/K4notice.html
        http://its.psu.edu/
        http://its.psu.edu/computing.html
        http://its.psu.edu/learning.html
        http://search.psu.edu/query.html
        

        ...sorting is done first on the 2d level TLD, then on hostname if any, then on the remaining string if any. (I thought i already wrote that in OP; perhaps was not clear...)

        Lest we forget the question, is there a less verbose way (than the one in OP) to sort the URLs on criteria just presented above?

        (Long) Side note: FWIW, i converted the given Schwartzian transform to Gottman-Rosler Transform as an exercise, which was faster around 14-16% (benchmarked, Perl 5.8, merge/quick sorts, FreeBSD 4.7/386) -- not much of a difference (to me in this case, unless i am missing something).

        I beg to differ, but simple sorting urls will put http://www.google.com/stuff nowhere near http://google.com/stuff or http://translate.google.com/stuff?