Heuristic for parsing Host name and domain

persistence911 has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks ,

please has anyone found a good heuristic that can be used to extract correctly domain name with subdomain Which also containg two Word Top level domain.

Take an example
www.drane.fresel.co.uk
www.drane2.ws.ru
www.drane3.Intern.com
[download]

Where Example one is spitted into Drane.Fresel and TLD is co.uk

also Example 2 is drane2 and TLD is ws.ru

Example 3 is drane3.Intern and TLD is .com

I would really love some wisdom on how I can approach a problem of this nature. The URI module in perl CPAN does not help for this case .

Comment on Heuristic for parsing Host name and domain Download Code

Replies are listed 'Best First'.
Re: Heuristic for parsing Host name and domain by Anonymous Monk on Aug 29, 2011 at 15:20 UTC
Public Suffix List, Domain::PublicSuffix	[reply]
Re^2: Heuristic for parsing Host name and domain by ikegami (Patriarch) on Aug 29, 2011 at 18:49 UTC
This is definitely the way to go. ++	[reply]
Re: Heuristic for parsing Host name and domain by flexvault (Monsignor) on Aug 29, 2011 at 15:22 UTC
Based upon your requirements I would do something like this. You might also add a hash of all valid TLD combinations, but that might be what you want to get. `#!/usr/bin/perl -w use strict; while( my $url = <DATA> ) { chomp($url); my @Url = split(/\./,$url); if ( length($Url[$#Url]) == 3 ) { print "$url:\t\tTLD is '".$Url[$#Url]."'\n"; } else { print "$url:\t\tTLD is '".$Url[$#Url-1].".".$Url[$#Url]. +"'\n"; } } 1; __DATA__ www.drane.fresel.co.uk www.drane2.ws.uk www.drane3.Intern.com` [download] result: `www.drane.fresel.co.uk: TLD is 'co.uk' www.drane2.ws.uk: TLD is 'ws.uk' www.drane3.Intern.com: TLD is 'com'` [download] I use this in a Spam detection program. It gets allot more complicated than this! Good Luck. "Well done is better than well said." - Benjamin Franklin	[reply] [d/l] [select]
Re^2: Heuristic for parsing Host name and domain by Corion (Patriarch) on Aug 29, 2011 at 15:43 UTC
Your approach seems to fail for `.de` in one case and `.name`, `.mobi`, `.museum` in the other case. The only approach is to use a list of known TLDs.	[reply] [d/l] [select]
Re^3: Heuristic for parsing Host name and domain by flexvault (Monsignor) on Aug 29, 2011 at 18:16 UTC
I wasn't trying to be exhaustive. I was showing an approach for what he needed. In the context of spam checking, it is the 3 and 5 letter TLDs that are the greatest problems. But you are correct! "Well done is better than well said." - Benjamin Franklin	[reply]
Re: Heuristic for parsing Host name and domain by DanielSpaniel (Scribe) on Aug 29, 2011 at 15:18 UTC
I'm no Perl monk, and I'm sure there may well be better ways of doing it, but the code below might help. I expect there is some module out there with valid TLDs, but, regardless, if you happen to know each of the TLDs you'll be dealing with then you could do something like the code below maybe: `push (my @domains,'www.drane.fresel.co.uk','www.drane2.ws.ru','www.dra +ne3.Intern.com'); for (@domains) { my $name=''; my $tld=''; if ($_=~/\.co\.uk$/) { $name=substr($_,0,rindex($_,'co.uk')-1); $tld='co.uk'; } elsif ($_=~/\.ws\.ru$/) { $name=substr($_,0,rindex($_,'ws.ru')-1); $tld='ws.ru'; } elsif ($_=~/\.com$/) { $name=substr($_,0,rindex($_,'com')-1); $tld='com'; } print "\$name:\t$name\n\$tld:\t$tld\n\n"; }` [download] ... where you just read all your domain names into a list array, and then simply process the list.	[reply] [d/l]
Re: Heuristic for parsing Host name and domain by locked_user sundialsvc4 (Abbot) on Aug 31, 2011 at 01:12 UTC
Also be mindful of the (very large...) set of routines in CPAN under the category Regexp::Common. In general, “if you are looking for a regular expression that you suspect probably has been written before, it probably has.” And if you are looking for something that might need more than just a regular expression, i.e. programmed logic of some kind ... once again, CPAN probably already has it. Therefore, get into this habit (and I sorely wish I could take my own advice): look before you code.