Re: Heuristic for parsing Host name and domain

Based upon your requirements I would do something like this. You might also add a hash of all valid TLD combinations, but that might be what you want to get.

#!/usr/bin/perl -w
   
use strict;
   
while( my $url = <DATA> )
   {  chomp($url);
      my @Url = split(/\./,$url);
      if ( length($Url[$#Url]) == 3 )
          {   print "$url:\t\tTLD is '".$Url[$#Url]."'\n";
          }
      else
          {   print "$url:\t\tTLD is '".$Url[$#Url-1].".".$Url[$#Url].
+"'\n";
          }
   }
   
   1;
   
   __DATA__
   www.drane.fresel.co.uk
   www.drane2.ws.uk
   www.drane3.Intern.com
[download]

result:

www.drane.fresel.co.uk:         TLD is 'co.uk'
www.drane2.ws.uk:               TLD is 'ws.uk'
www.drane3.Intern.com:          TLD is 'com'
[download]

I use this in a Spam detection program. It gets allot more complicated than this!

Good Luck.

"Well done is better than well said." - Benjamin Franklin

Comment on Re: Heuristic for parsing Host name and domain Select or Download Code

Replies are listed 'Best First'.
Re^2: Heuristic for parsing Host name and domain by Corion (Patriarch) on Aug 29, 2011 at 15:43 UTC
Your approach seems to fail for `.de` in one case and `.name`, `.mobi`, `.museum` in the other case. The only approach is to use a list of known TLDs.	[reply] [d/l] [select]
Re^3: Heuristic for parsing Host name and domain by flexvault (Monsignor) on Aug 29, 2011 at 18:16 UTC
I wasn't trying to be exhaustive. I was showing an approach for what he needed. In the context of spam checking, it is the 3 and 5 letter TLDs that are the greatest problems. But you are correct! "Well done is better than well said." - Benjamin Franklin	[reply]