A couple of things. I didn't use the GRT in my original post for speed (for a change:), but rather simplicity. In your version you use a regex to extract the domain name and protocol, but then revert to substr/index to get the path info. This can be done using the same regex. You can also greatly simplify the logic for doing your (slightly peculiar:) re-ordering of the parts of the domain name by using the power of slices to acheive your desired ordering.
The simplification yeilds some performance benefit and matches that of your substr/index version of the GRT, but you can probably implement the same changes to that version and re-establish it's lead.
sub guttman_rosler_uk {
my @host;
print STDERR +( $_ , "\n\n") foreach
map { substr( $_ , 1+index($_ , $;) ) }
sort
map {
m[(?:(^[^:]+)://)?([^/]+)(.*$)];
@host = (reverse( split '\.' , $2), $3, $1);
lc( join'_', @host[1,0, 2..$#host] )
. $; . $_ ;
} keys %hist;
return;
}
However, a more critical observation is that your method of ordering the sub-components of the domain name could be considered broken.
Using it, all .co.uk will get sorted together, but these will be a long way away from .ac.uk and .gov.uk etc. Many non-US countries have categorisations within the county-specific TLD. Eg. com.au in Australia etc. In fact I think almost every country except the US does this to some degree or another.
This was the reason I suggested that you completely reverse the order of the domain name components. That way all .uk are grouped, within that all the .ac, .co, .gov etc. This is why this technique is used for Java .jar naming conventions. Tim Berners-Lee is also on record as saying that if there was one change that he wishes he could make to his definition for the WWW, it is that he wishes he had ordered the domain names in the reverse order.
The only advantage that I can see with your schema is that ibm.com will get sorted close to ibm.jp and ibm.de, but this will still screw up when you get to ibm.co.uk or ibm.com.au.
This, I think, is why IBM (and others) have fairly recently moved away from using the country-specific domain names (except to redirect to their main TLD) in favour of ibm.com/uk and ibm.com/jp etc.
Of course, you know best as to your particular needs, but I thought that I would mention it anyway.
Examine what is said, not who speaks.
1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
3) Any sufficiently advanced technology is indistinguishable from magic.
Arthur C. Clarke.
|