comment on

http://aset.its.psu.edu/announcements/newsgroup_changes.html
http://aset.its.psu.edu/unix_group/
http://aset.its.psu.edu/unix_group/unixaccounts.html
http://aset.psu.edu/ait/
http://aset.psu.edu/ait/filesys.html
http://aset.psu.edu/unix_group/lsfaqs.html
http://aset.psu.edu/unix_group/quickunix.html
http://cac.psu.edu/
http://cac.psu.edu/publish/htpasswd/alternate.html
http://clc.its.psu.edu/
http://clc.its.psu.edu/Labs/
http://clc.its.psu.edu/Labs/Mac/
http://clc.its.psu.edu/labs/Mac/software/all.aspx
http://clc.its.psu.edu/labs/Mac/software/default.aspx
http://css.its.psu.edu/internet/
http://css.its.psu.edu/internet/unix.html
http://css.its.psu.edu/news/alerts/
http://css.its.psu.edu/news/alerts/K4notice.html
http://its.psu.edu/
http://its.psu.edu/computing.html
http://its.psu.edu/learning.html
http://search.psu.edu/query.html

...is sorted by current algorithm (in OP) in the following desired order...

http://aset.psu.edu/ait/
http://aset.psu.edu/ait/filesys.html
http://aset.psu.edu/unix_group/lsfaqs.html
http://aset.psu.edu/unix_group/quickunix.html
http://cac.psu.edu/
http://cac.psu.edu/publish/htpasswd/alternate.html
http://aset.its.psu.edu/announcements/newsgroup_changes.html
http://aset.its.psu.edu/unix_group/
http://aset.its.psu.edu/unix_group/unixaccounts.html
http://clc.its.psu.edu/
http://clc.its.psu.edu/Labs/
http://clc.its.psu.edu/Labs/Mac/
http://clc.its.psu.edu/labs/Mac/software/all.aspx
http://clc.its.psu.edu/labs/Mac/software/default.aspx
http://css.its.psu.edu/internet/
http://css.its.psu.edu/internet/unix.html
http://css.its.psu.edu/news/alerts/
http://css.its.psu.edu/news/alerts/K4notice.html
http://its.psu.edu/
http://its.psu.edu/computing.html
http://its.psu.edu/learning.html
http://search.psu.edu/query.html

...sorting is done first on the 2d level TLD, then on hostname if any, then on the remaining string if any. (I thought i already wrote that in OP; perhaps was not clear...)

Lest we forget the question, is there a less verbose way (than the one in OP) to sort the URLs on criteria just presented above?

(Long) Side note: FWIW, i converted the given Schwartzian transform to Gottman-Rosler Transform as an exercise, which was faster around 14-16% (benchmarked, Perl 5.8, merge/quick sorts, FreeBSD 4.7/386) -- not much of a difference (to me in this case, unless i am missing something).

URLs: 1505
  min: 13,  max: 1270
  mean: 144.12, median: 62.00

Benchmark: running GRT, ST for at least 10 CPU seconds...
       GRT: 10 wallclock seconds ( 8.77 usr +  1.29 sys = 10.06 CPU) @  7.25/s (n=73)
        ST: 10 wallclock seconds ( 8.91 usr +  1.20 sys = 10.10 CPU) @  6.24/s (n=63)
      Rate   ST  GRT
ST  6.24/s   -- -14%
GRT 7.25/s  16%   --

Below is the code which benchmarks GRT and ST. The output of the two subs is printed to STDERR so that it could be redirected to somewhere other than the terminal for benchmark purposes.

#!/usr/local/bin/perl -w

use strict;

#use sort '_quicksort';
#use sort '_mergesort';

use Benchmark qw(cmpthese);

use Netscape::History;
use Netscape::HistoryURL;

my $history = new Netscape::History("/home/parv/.netscape/history.dat"
+);
my %hist;

while ( defined (my $url = $history->next_url()) )
{
  ( $hist{ $url } = $url->title() ) =~ s/\s+/ /g;
  $hist{ $url } = $hist{ $url } || "--";
}
$history->close();

#  print some url statistics
printf
  "URLs: %u\n  min: %u,  max: %u\n  mean: %0.2f, median: %0.2f\n\n"
  , @{ stats() };


cmpthese( -10 , { 'ST'=> \&schwartz , 'GRT'=> \&guttman_rosler });


#  Guttman-Rosler Transform
sub guttman_rosler
{
  print STDERR +($hist{$_} , "\n", $_ , "\n\n")
    foreach
      #  extract the sorted complete URL list
      map { (split '\0' , $_ , 2)[1] }
      sort
        #  set up for extracting the complete URL & sortkeys
        map { #  extract host/domain components in reverse order
              my $scheme_end = index($_, '://') + 3;

              my @host =
                reverse
                  split '\.' , (split '/' , substr($_ , $scheme_end) ,
+ 2)[0];

              #  put domain at front; everything else afterwords
              #  ----
              #  poo.koo -> poo.koo
              #  web.poo.koo:80 -> poo.koo:80.web
              #  rand.web.poo.koo -> poo.koo.web.rand
              #  ----
              lc( +(scalar @host > 1 ? $host[1] . '.' : '')
                  . $host[0]
                  . ( scalar @host < 3
                      ? '' : '.' . join('.' , @host[2..$#host])
                    )
                  . substr($_ , length(join '.', @host) + $scheme_end)
                )
              . "\x00" . $_
            } keys %hist;
  return;
}

#  Schwartzian Transform
sub schwartz
{
  print STDERR +($hist{$_} , "\n", $_ , "\n\n")
    foreach
      #  extract the sorted complete URL list
      map { $_->[0] }

      #  sort on massged host name or the complete url
      sort { $a->[1] cmp $b->[1] }

        #  set up for extracting the complete URL & sortkeys
        map { #  extract host/domain components in reverse order
              my @host =
                reverse split '\.'
                          , ( split '/' , substr($_ , index($_, '://')
+ + 3)
                                        , 2
                            )[0];

              [ #  current URL
                $_

              , #  put domain at front; everything else afterwords
                #  ----
                #  poo.koo -> poo.koo
                #  web.poo.koo:80 -> poo.koo:80.web
                #  rand.web.poo.koo -> poo.koo.web.rand
                #  ----
                lc +(scalar @host > 1 ? $host[1] . '.' : '')
                  . $host[0]
                  . +( scalar @host < 3
                        ? '' : '.' . join('.' , @host[2..$#host])
                      )
                  . substr($_ , length(join '.', @host) + index($_, ':
+//') + 3)
              ]
            } keys %hist;
  return;
}

#  calculate some URL statistics
sub stats
{
  my $total = 0;
  my @len;

  foreach ( sort {length $a <=> length $b} keys %hist )
  {
    my $len = length $_;

    $total += $len;

    push @len, $len;
  }

  my $number = scalar keys %hist;

  my $mid_pt = int($number / 2);
  my $median =
    $mid_pt % 2 == 0 ? $len[$mid_pt]
                     : ($len[$mid_pt] + $len[$mid_pt + 1]) / 2;

  return [  $number , $len[0] , $len[-1] , $total/$number , $median ];
}
[download]

In reply to Re: Re: Re: Re: Sorting URLs on domain/host: sortkeys generation by parv
in thread Sorting URLs on domain/host: sortkeys generation by parv

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.