in reply to Re: Sorting URLs on domain/host: sortkeys generation
in thread Sorting URLs on domain/host: sortkeys generation

Yes, i would have been satisfied by a simple sort if the domain/host names were of type scheme :// word 1 . word 2 /? (blah)?. In reallity, there are often more than two words say www.google.com or www.pacareerlink.state.pa.us. Thus the reason for non-trivial sorting.

Thanks for the tip for sorting by IP addresses though.

Replies are listed 'Best First'.
Re: Re: Re: Sorting URLs on domain/host: sortkeys generation
by tachyon (Chancellor) on Mar 30, 2003 at 09:40 UTC

    Actually it does not matter that there are \w\.\w... sequences as . sorts before \w you get the desired result. The http:// is also immaterial provided all entries either have (or don't have it). cmp sorting does not stop at the first non word - it simply sorts in ASCII order.

    print "$_\n" for sort qw ( http://. http://www.google.com http://www.google.co.uk http://au.google.com http://au.goo.com http://au.goop.com ); __DATA__ http://. http://au.goo.com http://au.google.com http://au.goop.com http://www.google.co.uk http://www.google.com

    This looks appropriately sorted to me. The IP code will get you the domain (or IP) in $1 regardless so you can easily modify it, but as this shows you don't really need to unless you want to trim off the ftp:// http:// https:// part and thus lump these in one group. The only other modification you can do to the domain name is chop the www. off (trying to guess other subdomains is a hopeless task) Otherwise the default cmp should work fine. Perhaps you could post an example of where it is not?

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

      I'm quite sure the OP wanted something like:

      http://. http://au.goo.com http://au.goop.com http://au.google.com http://www.google.com http://www.google.co.uk

      That is, sorted by TLD, then 2nd level domain, then 3rd, and so on. At least, it's what his code does (notice the reverse).

      -- 
              dakkar - Mobilis in mobile
      
      http://aset.its.psu.edu/announcements/newsgroup_changes.html
      http://aset.its.psu.edu/unix_group/
      http://aset.its.psu.edu/unix_group/unixaccounts.html
      http://aset.psu.edu/ait/
      http://aset.psu.edu/ait/filesys.html
      http://aset.psu.edu/unix_group/lsfaqs.html
      http://aset.psu.edu/unix_group/quickunix.html
      http://cac.psu.edu/
      http://cac.psu.edu/publish/htpasswd/alternate.html
      http://clc.its.psu.edu/
      http://clc.its.psu.edu/Labs/
      http://clc.its.psu.edu/Labs/Mac/
      http://clc.its.psu.edu/labs/Mac/software/all.aspx
      http://clc.its.psu.edu/labs/Mac/software/default.aspx
      http://css.its.psu.edu/internet/
      http://css.its.psu.edu/internet/unix.html
      http://css.its.psu.edu/news/alerts/
      http://css.its.psu.edu/news/alerts/K4notice.html
      http://its.psu.edu/
      http://its.psu.edu/computing.html
      http://its.psu.edu/learning.html
      http://search.psu.edu/query.html
      

      ...is sorted by current algorithm (in OP) in the following desired order...

      http://aset.psu.edu/ait/
      http://aset.psu.edu/ait/filesys.html
      http://aset.psu.edu/unix_group/lsfaqs.html
      http://aset.psu.edu/unix_group/quickunix.html
      http://cac.psu.edu/
      http://cac.psu.edu/publish/htpasswd/alternate.html
      http://aset.its.psu.edu/announcements/newsgroup_changes.html
      http://aset.its.psu.edu/unix_group/
      http://aset.its.psu.edu/unix_group/unixaccounts.html
      http://clc.its.psu.edu/
      http://clc.its.psu.edu/Labs/
      http://clc.its.psu.edu/Labs/Mac/
      http://clc.its.psu.edu/labs/Mac/software/all.aspx
      http://clc.its.psu.edu/labs/Mac/software/default.aspx
      http://css.its.psu.edu/internet/
      http://css.its.psu.edu/internet/unix.html
      http://css.its.psu.edu/news/alerts/
      http://css.its.psu.edu/news/alerts/K4notice.html
      http://its.psu.edu/
      http://its.psu.edu/computing.html
      http://its.psu.edu/learning.html
      http://search.psu.edu/query.html
      

      ...sorting is done first on the 2d level TLD, then on hostname if any, then on the remaining string if any. (I thought i already wrote that in OP; perhaps was not clear...)

      Lest we forget the question, is there a less verbose way (than the one in OP) to sort the URLs on criteria just presented above?

      (Long) Side note: FWIW, i converted the given Schwartzian transform to Gottman-Rosler Transform as an exercise, which was faster around 14-16% (benchmarked, Perl 5.8, merge/quick sorts, FreeBSD 4.7/386) -- not much of a difference (to me in this case, unless i am missing something).

      I beg to differ, but simple sorting urls will put http://www.google.com/stuff nowhere near http://google.com/stuff or http://translate.google.com/stuff?