downer has asked for the wisdom of the Perl Monks concerning the following question:

I can think of some heuristics to do this involving regular expressions, however, I am sure there are exceptions to my rules and cases which I have never seen rendering my technique invalid. I was thinking of something like the following:
my $tld; my $url = 'http://www.someurl.com/index.html'; if($url =~ /(http:\/\/)?(.+?)\//) { my $host = $2; my @host_parts = split(/\./, $host); my $len = @host_parts; if(length($host_parts[-1]) <= 2 && $len > 2) { $tld = join('.',@host_parts[-2..-1]); } else { $tld = $host_parts[-1]; } }
Can any monks think of how to improve this? Is there any module for doing the following which i didnt find?

Replies are listed 'Best First'.
Re: Finding the Top Level Domain from a URL
by ikegami (Patriarch) on Jun 17, 2009 at 22:10 UTC

    Your code seemingly intentionally finds something other than the TLD in some situations.

    • http://www.ibm.com/com (ok)
    • http://www.ibm.ca/ibm.ca (not tld)
    • http://www.ibm.co.uk/co.uk (not tld)

    It also mishandles a number of the valid urls listed below. Fix:

    $url = URI->new($url); defined( my $host = $url->host() ) or die("No host\n"); my $tld; if ($host =~ /\./) { $tld = /\.([^.]+)$/; $tld =~ /[a-z]/i or die("No domain\n"); } else { $host =~ /[a-z]/i or die("No domain\n"); $tld = 'localdomain'; }

    Handles valid urls

    • http://example.com/ (com)
    • http://example.com./ (com)
    • http://example.com (com)
    • http://example.com:80/ (com)
    • http://example/ (localdomain)
    • http://www.ibm.com/ (com)
    • http://www.ibm.ca/ (ca)
    • http://www.ibm.co.uk/ (uk)
    • http://www.ibm.com.au/ (au)
    • http://192.168.0.1/ (error)
    • http://3232235521/ (error)

    Invalid urls aren't necessarily detected.

    Update: Updated code to detects an invalid url it didn't detect before.

Re: Finding the Top Level Domain from a URL
by Anonymous Monk on Jun 18, 2009 at 02:49 UTC
    #!/usr/bin/perl -- use strict; use warnings; use URI; my $uri = URI->new('http://www.someurl.com/index.html'); my @host = split /\./, $uri->host; my $tld = $host[-1]; print "$tld\n"; __END__
    URI, Net::Domain::TLD, Regexp::Common::URI