persistence911 has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks ,

please has anyone found a good heuristic that can be used to extract correctly domain name with subdomain Which also containg two Word Top level domain.

Take an example www.drane.fresel.co.uk www.drane2.ws.ru www.drane3.Intern.com

Where Example one is spitted into Drane.Fresel and TLD is co.uk

also Example 2 is drane2 and TLD is ws.ru

Example 3 is drane3.Intern and TLD is .com

I would really love some wisdom on how I can approach a problem of this nature. The URI module in perl CPAN does not help for this case .

Replies are listed 'Best First'.
Re: Heuristic for parsing Host name and domain
by Anonymous Monk on Aug 29, 2011 at 15:20 UTC
      This is definitely the way to go. ++
Re: Heuristic for parsing Host name and domain
by flexvault (Monsignor) on Aug 29, 2011 at 15:22 UTC

    Based upon your requirements I would do something like this. You might also add a hash of all valid TLD combinations, but that might be what you want to get.

    #!/usr/bin/perl -w use strict; while( my $url = <DATA> ) { chomp($url); my @Url = split(/\./,$url); if ( length($Url[$#Url]) == 3 ) { print "$url:\t\tTLD is '".$Url[$#Url]."'\n"; } else { print "$url:\t\tTLD is '".$Url[$#Url-1].".".$Url[$#Url]. +"'\n"; } } 1; __DATA__ www.drane.fresel.co.uk www.drane2.ws.uk www.drane3.Intern.com

    result:

    www.drane.fresel.co.uk: TLD is 'co.uk' www.drane2.ws.uk: TLD is 'ws.uk' www.drane3.Intern.com: TLD is 'com'

    I use this in a Spam detection program. It gets allot more complicated than this!

    Good Luck.

    "Well done is better than well said." - Benjamin Franklin

      Your approach seems to fail for .de in one case and .name, .mobi, .museum in the other case. The only approach is to use a list of known TLDs.

        I wasn't trying to be exhaustive. I was showing an approach for what he needed.

        In the context of spam checking, it is the 3 and 5 letter TLDs that are the greatest problems.

        But you are correct!

        "Well done is better than well said." - Benjamin Franklin

Re: Heuristic for parsing Host name and domain
by DanielSpaniel (Scribe) on Aug 29, 2011 at 15:18 UTC

    I'm no Perl monk, and I'm sure there may well be better ways of doing it, but the code below might help. I expect there is some module out there with valid TLDs, but, regardless, if you happen to know each of the TLDs you'll be dealing with then you could do something like the code below maybe:

    push (my @domains,'www.drane.fresel.co.uk','www.drane2.ws.ru','www.dra +ne3.Intern.com'); for (@domains) { my $name=''; my $tld=''; if ($_=~/\.co\.uk$/) { $name=substr($_,0,rindex($_,'co.uk')-1); $tld='co.uk'; } elsif ($_=~/\.ws\.ru$/) { $name=substr($_,0,rindex($_,'ws.ru')-1); $tld='ws.ru'; } elsif ($_=~/\.com$/) { $name=substr($_,0,rindex($_,'com')-1); $tld='com'; } print "\$name:\t$name\n\$tld:\t$tld\n\n"; }

    ... where you just read all your domain names into a list array, and then simply process the list.

Re: Heuristic for parsing Host name and domain
by locked_user sundialsvc4 (Abbot) on Aug 31, 2011 at 01:12 UTC

    Also be mindful of the (very large...) set of routines in CPAN under the category Regexp::Common.

    In general, “if you are looking for a regular expression that you suspect probably has been written before, it probably has.”   And if you are looking for something that might need more than just a regular expression, i.e. programmed logic of some kind ... once again, CPAN probably already has it.   Therefore, get into this habit (and I sorely wish I could take my own advice):   look before you code.