in reply to Re: Re: Re: Sorting URLs on domain/host: sortkeys generation
in thread Sorting URLs on domain/host: sortkeys generation
http://aset.its.psu.edu/announcements/newsgroup_changes.html http://aset.its.psu.edu/unix_group/ http://aset.its.psu.edu/unix_group/unixaccounts.html http://aset.psu.edu/ait/ http://aset.psu.edu/ait/filesys.html http://aset.psu.edu/unix_group/lsfaqs.html http://aset.psu.edu/unix_group/quickunix.html http://cac.psu.edu/ http://cac.psu.edu/publish/htpasswd/alternate.html http://clc.its.psu.edu/ http://clc.its.psu.edu/Labs/ http://clc.its.psu.edu/Labs/Mac/ http://clc.its.psu.edu/labs/Mac/software/all.aspx http://clc.its.psu.edu/labs/Mac/software/default.aspx http://css.its.psu.edu/internet/ http://css.its.psu.edu/internet/unix.html http://css.its.psu.edu/news/alerts/ http://css.its.psu.edu/news/alerts/K4notice.html http://its.psu.edu/ http://its.psu.edu/computing.html http://its.psu.edu/learning.html http://search.psu.edu/query.html
...is sorted by current algorithm (in OP) in the following desired order...
http://aset.psu.edu/ait/ http://aset.psu.edu/ait/filesys.html http://aset.psu.edu/unix_group/lsfaqs.html http://aset.psu.edu/unix_group/quickunix.html http://cac.psu.edu/ http://cac.psu.edu/publish/htpasswd/alternate.html http://aset.its.psu.edu/announcements/newsgroup_changes.html http://aset.its.psu.edu/unix_group/ http://aset.its.psu.edu/unix_group/unixaccounts.html http://clc.its.psu.edu/ http://clc.its.psu.edu/Labs/ http://clc.its.psu.edu/Labs/Mac/ http://clc.its.psu.edu/labs/Mac/software/all.aspx http://clc.its.psu.edu/labs/Mac/software/default.aspx http://css.its.psu.edu/internet/ http://css.its.psu.edu/internet/unix.html http://css.its.psu.edu/news/alerts/ http://css.its.psu.edu/news/alerts/K4notice.html http://its.psu.edu/ http://its.psu.edu/computing.html http://its.psu.edu/learning.html http://search.psu.edu/query.html
...sorting is done first on the 2d level TLD, then on hostname if any, then on the remaining string if any. (I thought i already wrote that in OP; perhaps was not clear...)
Lest we forget the question, is there a less verbose way (than the one in OP) to sort the URLs on criteria just presented above?
(Long) Side note: FWIW, i converted the given Schwartzian transform to Gottman-Rosler Transform as an exercise, which was faster around 14-16% (benchmarked, Perl 5.8, merge/quick sorts, FreeBSD 4.7/386) -- not much of a difference (to me in this case, unless i am missing something).
URLs: 1505
min: 13, max: 1270
mean: 144.12, median: 62.00
Benchmark: running GRT, ST for at least 10 CPU seconds...
GRT: 10 wallclock seconds ( 8.77 usr + 1.29 sys = 10.06 CPU) @ 7.25/s (n=73)
ST: 10 wallclock seconds ( 8.91 usr + 1.20 sys = 10.10 CPU) @ 6.24/s (n=63)
Rate ST GRT
ST 6.24/s -- -14%
GRT 7.25/s 16% --
Below is the code which benchmarks GRT and ST. The output of the two subs is printed to STDERR so that it could be redirected to somewhere other than the terminal for benchmark purposes.
#!/usr/local/bin/perl -w use strict; #use sort '_quicksort'; #use sort '_mergesort'; use Benchmark qw(cmpthese); use Netscape::History; use Netscape::HistoryURL; my $history = new Netscape::History("/home/parv/.netscape/history.dat" +); my %hist; while ( defined (my $url = $history->next_url()) ) { ( $hist{ $url } = $url->title() ) =~ s/\s+/ /g; $hist{ $url } = $hist{ $url } || "--"; } $history->close(); # print some url statistics printf "URLs: %u\n min: %u, max: %u\n mean: %0.2f, median: %0.2f\n\n" , @{ stats() }; cmpthese( -10 , { 'ST'=> \&schwartz , 'GRT'=> \&guttman_rosler }); # Guttman-Rosler Transform sub guttman_rosler { print STDERR +($hist{$_} , "\n", $_ , "\n\n") foreach # extract the sorted complete URL list map { (split '\0' , $_ , 2)[1] } sort # set up for extracting the complete URL & sortkeys map { # extract host/domain components in reverse order my $scheme_end = index($_, '://') + 3; my @host = reverse split '\.' , (split '/' , substr($_ , $scheme_end) , + 2)[0]; # put domain at front; everything else afterwords # ---- # poo.koo -> poo.koo # web.poo.koo:80 -> poo.koo:80.web # rand.web.poo.koo -> poo.koo.web.rand # ---- lc( +(scalar @host > 1 ? $host[1] . '.' : '') . $host[0] . ( scalar @host < 3 ? '' : '.' . join('.' , @host[2..$#host]) ) . substr($_ , length(join '.', @host) + $scheme_end) ) . "\x00" . $_ } keys %hist; return; } # Schwartzian Transform sub schwartz { print STDERR +($hist{$_} , "\n", $_ , "\n\n") foreach # extract the sorted complete URL list map { $_->[0] } # sort on massged host name or the complete url sort { $a->[1] cmp $b->[1] } # set up for extracting the complete URL & sortkeys map { # extract host/domain components in reverse order my @host = reverse split '\.' , ( split '/' , substr($_ , index($_, '://') + + 3) , 2 )[0]; [ # current URL $_ , # put domain at front; everything else afterwords # ---- # poo.koo -> poo.koo # web.poo.koo:80 -> poo.koo:80.web # rand.web.poo.koo -> poo.koo.web.rand # ---- lc +(scalar @host > 1 ? $host[1] . '.' : '') . $host[0] . +( scalar @host < 3 ? '' : '.' . join('.' , @host[2..$#host]) ) . substr($_ , length(join '.', @host) + index($_, ': +//') + 3) ] } keys %hist; return; } # calculate some URL statistics sub stats { my $total = 0; my @len; foreach ( sort {length $a <=> length $b} keys %hist ) { my $len = length $_; $total += $len; push @len, $len; } my $number = scalar keys %hist; my $mid_pt = int($number / 2); my $median = $mid_pt % 2 == 0 ? $len[$mid_pt] : ($len[$mid_pt] + $len[$mid_pt + 1]) / 2; return [ $number , $len[0] , $len[-1] , $total/$number , $median ]; }
|
|---|