in reply to Re: Sorting URLs on domain/host: sortkeys generation
in thread Sorting URLs on domain/host: sortkeys generation

First reading about the two transforms and then actually implementing them, i realized that programming is about compromise among readability, speed, typing, and ease of implementation.

Some direct observations which led to the code presented later...

Below are the benchmark results as promised. In the results below, difference between "GRT" and "GRT UK" is that the former doesn't use regexen. Differences between "GRT" and "ST" are inherent ones.

URL Length Statistics:
  Total URLs: 1496  Lengths' Sum: 216233
  Min: 13  Max: 1270
  Mean: 144.541  Median: 62.000
  Std. Deviation: 235.041  Variance: 55244.207

Benchmark: running GRT, GRT UK, ST for at least 5 CPU seconds...
       GRT:  5 wallclock seconds ( 4.61 usr +  0.39 sys =  5.00 CPU) @  9.00/s (n=45)
    GRT UK:  5 wallclock seconds ( 4.84 usr +  0.36 sys =  5.20 CPU) @  7.69/s (n=40)
        ST:  5 wallclock seconds ( 4.78 usr +  0.24 sys =  5.02 CPU) @  7.17/s (n=36)
         Rate     ST GRT UK    GRT
ST     7.17/s     --    -7%   -20%
GRT UK 7.69/s     7%     --   -15%
GRT    9.00/s    26%    17%     --

You will notice in the code that key generation part is pretty much the same (in order to get the same output) among all (two)three transforms. Code follows...

#!/usr/local/bin/perl -w use strict; #use sort '_quicksort'; #use sort '_mergesort'; #use sort 'stable'; use Benchmark qw(cmpthese); # (global) url & page title hash my %hist; # fill %hist fill(); # print some url statistics length_stat( [ keys %hist ] ); #printf "\nCurrent sort: %s\n" , sort::current(); cmpthese( -5 , { 'GRT' => \&guttman_rosler , 'GRT UK' => \&guttman_rosler_uk , 'ST' => \&schwartz } ); # Guttman-Rosler Transform sub guttman_rosler { #print STDERR +($hist{$_} , "\n", $_ , "\n\n") print STDERR +( $_ , "\n\n") foreach map { substr( $_ , index ($_ , $;) +1) } sort map { # extract host/domain components in reverse order my $scheme_end = index($_, '://') + 3; my @host = reverse split '\.' , (split '/' , substr($_ , $scheme_end) , 2)[ +0]; my $items = scalar @host; # put domain at front, then host parts if any, then every +thing else lc( # poo.koo -> poo_koo ($items > 1 ? $host[1] . '_' : '') . $host[0] # rand.web.poo.koo -> poo_koo_web_rand . ( $items < 3 ? '' : '_' . join('_' , @host[2 .. $items -1]) ) # http://rand.web.poo.koo -> poo_koo_web_rand_http:// . '_' . substr($_ , 0 , $scheme_end) # http://rand.web.poo.koo/blah/blah # -> poo_koo_web_rand_http:///blah/blah . substr($_ , length(join '_', @host) + $scheme_end) ) . $; . $_ } keys %hist; return; } # Guttman-Rosler Transform -- uses regex instead of substr()/index() sub guttman_rosler_uk { #print STDERR +($hist{$_} , "\n", $_ , "\n\n") print STDERR +( $_ , "\n\n") foreach map { substr( $_ , index ($_ , $;) +1) } sort map { m!^ (?: (\w+):// )? ([^/]+) !x; my @host = reverse split '\.' , $2; my $items = scalar @host; # put domain at front; everything else afterwords lc( ($items > 1 ? $host[1] . '_' : '') . $host[0] . ( $items < 3 ? '' : '_' . join '_' , @host[2 .. $items -1] ) . '_' . ($1 || '') . substr($_ , index($_ , $2) + length $2) ) . $; . $_ ; } keys %hist; return; } # Schwartzian Transform sub schwartz { #print STDERR +($hist{$_} , "\n", $_ , "\n\n") print STDERR +( $_ , "\n\n") foreach map { $_->[0] } sort { $a->[1] cmp $b->[1] || $a->[0] cmp $b->[0] } map { # extract host/domain components in reverse order my $scheme_end = index($_, '://') + 3; my @host = reverse split '\.' , (split '/' , substr($_ , $scheme_end) , 2)[ +0]; my $items = scalar @host; [ $_ , # put domain at front; everything else afterwords lc( ($items > 1 ? $host[1] . '_' : '') . $host[0] . ( $items < 3 ? '' : '_' . join('_' , @host[2 .. $items -1]) ) . '_' . substr($_ , 0 , $scheme_end) . substr($_ , length(join '_', @host) + $scheme_end) ) ] } keys %hist; return; } # fill %hist sub fill { use Netscape::History; use Netscape::HistoryURL; my $history = new Netscape::History("/home/parv/.netscape/history.da +t"); while ( defined (my $url = $history->next_url()) ) { ( $hist{ $url } = $url->title() ) =~ s/\s+/ /g; $hist{ $url } = $hist{ $url } || "--"; } $history->close(); } # calculate some statistics of elements' lengths sub length_stat { # store sorted elements my @in = sort {length $a <=> length $b} @{ $_[0] }; my %stat; @stat{qw( length_sum count min max median mean var std_dev )} = (0 , scalar @in ); # convert elements to their lengths; get sum of lengths $stat{'length_sum'} += ($_ = length) foreach @in; @stat{qw( min max mean )} = ( @in[0 , -1] , $stat{'length_sum'} / $stat{'count'} ); # median { my $mid = int($stat{'count'} / 2); $stat{'median'} = ($mid % 2 != 0) ? $in[ $mid ] : ($in[ $mid ] + $in[ $mid - 1 ]) +/ 2; } # variance, thus std. deviation { my $sum_of_diffsq = 0; foreach (@in) { my $diff = $_ - $stat{'mean'}; $sum_of_diffsq += ($diff * $diff); } $stat{'std_dev'} = sqrt( $stat{'var'} = $sum_of_diffsq / $stat{'count'} ); } # print stats { my $fmt = "%0.3f"; print << "_STATS_"; URL Length Statistics: Total URLs: $stat{'count'} Lengths' Sum: $stat{'length_sum'} Min: $stat{'min'} Max: $stat{'max'} Mean: @{[sprintf $fmt, $stat{'mean'}]} Median: @{[sprintf $fmt, $st +at{'median'}]} Std. Deviation: @{[sprintf $fmt, $stat{'std_dev'}]} Variance: @{[sp +rintf $fmt, $stat{'var'}]} _STATS_ } return; }

Replies are listed 'Best First'.
Re: Re: Re: Sorting URLs on domain/host: sortkeys generation
by BrowserUk (Patriarch) on Mar 31, 2003 at 06:37 UTC

    A couple of things. I didn't use the GRT in my original post for speed (for a change:), but rather simplicity. In your version you use a regex to extract the domain name and protocol, but then revert to substr/index to get the path info. This can be done using the same regex. You can also greatly simplify the logic for doing your (slightly peculiar:) re-ordering of the parts of the domain name by using the power of slices to acheive your desired ordering.

    The simplification yeilds some performance benefit and matches that of your substr/index version of the GRT, but you can probably implement the same changes to that version and re-establish it's lead.

    sub guttman_rosler_uk { my @host; print STDERR +( $_ , "\n\n") foreach map { substr( $_ , 1+index($_ , $;) ) } sort map { m[(?:(^[^:]+)://)?([^/]+)(.*$)]; @host = (reverse( split '\.' , $2), $3, $1); lc( join'_', @host[1,0, 2..$#host] ) . $; . $_ ; } keys %hist; return; }

    However, a more critical observation is that your method of ordering the sub-components of the domain name could be considered broken.

    Using it, all .co.uk will get sorted together, but these will be a long way away from .ac.uk and .gov.uk etc. Many non-US countries have categorisations within the county-specific TLD. Eg. com.au in Australia etc. In fact I think almost every country except the US does this to some degree or another.

    This was the reason I suggested that you completely reverse the order of the domain name components. That way all .uk are grouped, within that all the .ac, .co, .gov etc. This is why this technique is used for Java .jar naming conventions. Tim Berners-Lee is also on record as saying that if there was one change that he wishes he could make to his definition for the WWW, it is that he wishes he had ordered the domain names in the reverse order.

    The only advantage that I can see with your schema is that ibm.com will get sorted close to ibm.jp and ibm.de, but this will still screw up when you get to ibm.co.uk or ibm.com.au.

    This, I think, is why IBM (and others) have fairly recently moved away from using the country-specific domain names (except to redirect to their main TLD) in favour of ibm.com/uk and ibm.com/jp etc.

    Of course, you know best as to your particular needs, but I thought that I would mention it anyway.


    Examine what is said, not who speaks.
    1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
    2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
    3) Any sufficiently advanced technology is indistinguishable from magic.
    Arthur C. Clarke.
      (:

      I never meant to imlpy that you posted the GRT version for speed. You see, before starting this thread, i did a search on sorting URLs by hostnames. That led to GRT, GRT/ST, the sort paper, and eventually the benchmark results i posted. I benchmarked your version too just to see how it would compare to my two previous solutions. Just plain curiosity.

      About order of host name components... i wanted to keep URLs sorted based on the 2d part of domain name in (com|net|org|edu), thus the "slightly peculiar" order. But then, i hadn't thought about the country code. Needless to say that that shows.

      So far i came upon only one or two such URLs in my netscape history. If i need to keep in the desired order, i have to work on the sorting of country codes too... back to drawing board...

      (Thanks by the way.)

Re: Re: Re: Sorting URLs on domain/host: sortkeys generation
by parv (Parson) on Mar 31, 2003 at 04:31 UTC

    Below is result of benchmark in which none of the three transform subs print (to STDERR), instead they just return the array reference to the whole thing (sub transform{ return[ map{} sort{} map{} (list) ]; }).

    URL Length Statistics:
      Total URLs: 1497  Lengths' Sum: 216277
      Min: 13  Max: 1270
      Mean: 144.474  Median: 62.000
      Std. Deviation: 234.977  Variance: 55214.052
    
    Benchmark: running GRT, GRT UK, ST for at least 5 CPU seconds...
           GRT:  5 wallclock secs ( 5.05 usr +  0.01 sys =  5.06 CPU) @ 10.07/s (n=51)
        GRT UK:  5 wallclock secs ( 5.08 usr +  0.00 sys =  5.08 CPU) @  8.47/s (n=43)
            ST:  5 wallclock secs ( 5.27 usr +  0.00 sys =  5.27 CPU) @  7.79/s (n=41)
             Rate     ST GRT UK    GRT
    ST     7.79/s     --    -8%   -23%
    GRT UK 8.47/s     9%     --   -16%
    GRT    10.1/s    29%    19%     --