Re: Re: Re: Re: Sorting URLs on domain/host: sortkeys generation

http://aset.its.psu.edu/announcements/newsgroup_changes.html
http://aset.its.psu.edu/unix_group/
http://aset.its.psu.edu/unix_group/unixaccounts.html
http://aset.psu.edu/ait/
http://aset.psu.edu/ait/filesys.html
http://aset.psu.edu/unix_group/lsfaqs.html
http://aset.psu.edu/unix_group/quickunix.html
http://cac.psu.edu/
http://cac.psu.edu/publish/htpasswd/alternate.html
http://clc.its.psu.edu/
http://clc.its.psu.edu/Labs/
http://clc.its.psu.edu/Labs/Mac/
http://clc.its.psu.edu/labs/Mac/software/all.aspx
http://clc.its.psu.edu/labs/Mac/software/default.aspx
http://css.its.psu.edu/internet/
http://css.its.psu.edu/internet/unix.html
http://css.its.psu.edu/news/alerts/
http://css.its.psu.edu/news/alerts/K4notice.html
http://its.psu.edu/
http://its.psu.edu/computing.html
http://its.psu.edu/learning.html
http://search.psu.edu/query.html

...is sorted by current algorithm (in OP) in the following desired order...

http://aset.psu.edu/ait/
http://aset.psu.edu/ait/filesys.html
http://aset.psu.edu/unix_group/lsfaqs.html
http://aset.psu.edu/unix_group/quickunix.html
http://cac.psu.edu/
http://cac.psu.edu/publish/htpasswd/alternate.html
http://aset.its.psu.edu/announcements/newsgroup_changes.html
http://aset.its.psu.edu/unix_group/
http://aset.its.psu.edu/unix_group/unixaccounts.html
http://clc.its.psu.edu/
http://clc.its.psu.edu/Labs/
http://clc.its.psu.edu/Labs/Mac/
http://clc.its.psu.edu/labs/Mac/software/all.aspx
http://clc.its.psu.edu/labs/Mac/software/default.aspx
http://css.its.psu.edu/internet/
http://css.its.psu.edu/internet/unix.html
http://css.its.psu.edu/news/alerts/
http://css.its.psu.edu/news/alerts/K4notice.html
http://its.psu.edu/
http://its.psu.edu/computing.html
http://its.psu.edu/learning.html
http://search.psu.edu/query.html

...sorting is done first on the 2d level TLD, then on hostname if any, then on the remaining string if any. (I thought i already wrote that in OP; perhaps was not clear...)

Lest we forget the question, is there a less verbose way (than the one in OP) to sort the URLs on criteria just presented above?

(Long) Side note: FWIW, i converted the given Schwartzian transform to Gottman-Rosler Transform as an exercise, which was faster around 14-16% (benchmarked, Perl 5.8, merge/quick sorts, FreeBSD 4.7/386) -- not much of a difference (to me in this case, unless i am missing something).

URLs: 1505
  min: 13,  max: 1270
  mean: 144.12, median: 62.00

Benchmark: running GRT, ST for at least 10 CPU seconds...
       GRT: 10 wallclock seconds ( 8.77 usr +  1.29 sys = 10.06 CPU) @  7.25/s (n=73)
        ST: 10 wallclock seconds ( 8.91 usr +  1.20 sys = 10.10 CPU) @  6.24/s (n=63)
      Rate   ST  GRT
ST  6.24/s   -- -14%
GRT 7.25/s  16%   --

Below is the code which benchmarks GRT and ST. The output of the two subs is printed to STDERR so that it could be redirected to somewhere other than the terminal for benchmark purposes.

#!/usr/local/bin/perl -w

use strict;

#use sort '_quicksort';
#use sort '_mergesort';

use Benchmark qw(cmpthese);

use Netscape::History;
use Netscape::HistoryURL;

my $history = new Netscape::History("/home/parv/.netscape/history.dat"
+);
my %hist;

while ( defined (my $url = $history->next_url()) )
{
  ( $hist{ $url } = $url->title() ) =~ s/\s+/ /g;
  $hist{ $url } = $hist{ $url } || "--";
}
$history->close();

#  print some url statistics
printf
  "URLs: %u\n  min: %u,  max: %u\n  mean: %0.2f, median: %0.2f\n\n"
  , @{ stats() };


cmpthese( -10 , { 'ST'=> \&schwartz , 'GRT'=> \&guttman_rosler });


#  Guttman-Rosler Transform
sub guttman_rosler
{
  print STDERR +($hist{$_} , "\n", $_ , "\n\n")
    foreach
      #  extract the sorted complete URL list
      map { (split '\0' , $_ , 2)[1] }
      sort
        #  set up for extracting the complete URL & sortkeys
        map { #  extract host/domain components in reverse order
              my $scheme_end = index($_, '://') + 3;

              my @host =
                reverse
                  split '\.' , (split '/' , substr($_ , $scheme_end) ,
+ 2)[0];

              #  put domain at front; everything else afterwords
              #  ----
              #  poo.koo -> poo.koo
              #  web.poo.koo:80 -> poo.koo:80.web
              #  rand.web.poo.koo -> poo.koo.web.rand
              #  ----
              lc( +(scalar @host > 1 ? $host[1] . '.' : '')
                  . $host[0]
                  . ( scalar @host < 3
                      ? '' : '.' . join('.' , @host[2..$#host])
                    )
                  . substr($_ , length(join '.', @host) + $scheme_end)
                )
              . "\x00" . $_
            } keys %hist;
  return;
}

#  Schwartzian Transform
sub schwartz
{
  print STDERR +($hist{$_} , "\n", $_ , "\n\n")
    foreach
      #  extract the sorted complete URL list
      map { $_->[0] }

      #  sort on massged host name or the complete url
      sort { $a->[1] cmp $b->[1] }

        #  set up for extracting the complete URL & sortkeys
        map { #  extract host/domain components in reverse order
              my @host =
                reverse split '\.'
                          , ( split '/' , substr($_ , index($_, '://')
+ + 3)
                                        , 2
                            )[0];

              [ #  current URL
                $_

              , #  put domain at front; everything else afterwords
                #  ----
                #  poo.koo -> poo.koo
                #  web.poo.koo:80 -> poo.koo:80.web
                #  rand.web.poo.koo -> poo.koo.web.rand
                #  ----
                lc +(scalar @host > 1 ? $host[1] . '.' : '')
                  . $host[0]
                  . +( scalar @host < 3
                        ? '' : '.' . join('.' , @host[2..$#host])
                      )
                  . substr($_ , length(join '.', @host) + index($_, ':
+//') + 3)
              ]
            } keys %hist;
  return;
}

#  calculate some URL statistics
sub stats
{
  my $total = 0;
  my @len;

  foreach ( sort {length $a <=> length $b} keys %hist )
  {
    my $len = length $_;

    $total += $len;

    push @len, $len;
  }

  my $number = scalar keys %hist;

  my $mid_pt = int($number / 2);
  my $median =
    $mid_pt % 2 == 0 ? $len[$mid_pt]
                     : ($len[$mid_pt] + $len[$mid_pt + 1]) / 2;

  return [  $number , $len[0] , $len[-1] , $total/$number , $median ];
}
[download]

Comment on Re: Re: Re: Re: Sorting URLs on domain/host: sortkeys generation Download Code