robharper has asked for the wisdom of the Perl Monks concerning the following question:

I have spotted a few related nodes through searching, but can't really find what I'm looking for, so...

I would like to scan a set of files for web URLs and recover just the (fully qualified) domain names, preferably without host names attached. Finding the URLs is no major problem, but the next step troubles me. I realise that there is little or no consistancy in how domains are arranged within CCTLDs, so some sort of database of rules would be needed to handle this fully.

Could someone please point me towards a module, program, or data set that might help me out here -- if such exists! If the worst comes, I could just strip the element before the first dot, which would probably do for most purposes, but is there a better way?

Replies are listed 'Best First'.
Re: Stripping domain names from URLs
by tachyon (Chancellor) on Sep 08, 2004 at 13:39 UTC

    There is consistency with how domains are arranged vis TLDs and ccTLDs and SLDs but it varies by country. I whipped up this module some time ago and expect it does what you want. It has all the TLDs, ccTLDs, and the SLDs for the major and easily accessible ccTLDs. Sorry there are no docs but it is pretty simple to RTFS. It is basically 400 lines of data with about 50 lines of code at the end. get_domain( URL, FLAG ) is probably what you want. See the source for what the flags do but you need to pass either 1 or 2 to get the domain only or subdomain(s).domain respectively.

      Thankyou, tachyon, that looks very helpful.

      Thanks also to everyone else who commented -- I will take a closer look at URI::URL and at learning how to be more precise with defining my problems. :o)

Re: Stripping domain names from URLs
by dragonchild (Archbishop) on Sep 08, 2004 at 12:46 UTC
    It sounds like you're suffering from poor definitions. First, you need to define what constitutes a domain name. Then, you need to define how to determine the exceptions. Then, and only then, can you code something reasonable to deal with it.

    What I would do is take a long look at how you, personally, parse a domain name. That will give you the definition and exceptions. And, no, it's not s/^[^.]\.//, in case you're wondering.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

    I shouldn't have to say this, but any code, unless otherwise stated, is untested

      Actually it is a reasonable description of the problem. Domain names exist in 3 basic forms:

      domain.TLD where TLD (Top Level Domain) is com, net, org etc domain.ccTLD where ccTLD (Country Code TLD) is say .de domain.SLD.ccTLD where SLD.ccTLD (Second Level Domain ccTLD) is com.au +, co.uk, etc # additionally we may have as a prefix www.domain.... subdomain.domain..... subsubdomain.subdomain.domain.....

      So the task is how to extract the DOMAIN component +/- subdomains. The issue is that if you split on the . then the DOMAIN component has an index of either 0,1,2...N if you look from the left or -3 or -2 if you look from the right. See my post below for a heuristic that deals with most of the weirdness. The most annoying freature is that domian.ccTLD exists. If not all you would need to do is pop 1 element off if it was a TLD and 2 if it was a ccTLD. Alas that would be too easy!

      cheers

      tachyon

        I understand that there is a precise problem statement for the issue the OP was dealing with, as you have stated. What I was getting at was the the OP hadn't found that precise statement.

        This is actually a more fundamental concept issue with our field in general - most programmers don't realize that development is at least 80% thinking and no more than 20% typing. I was reading some onLamp.com articles on FreeBSD installations and one author kept saying that installs were "99% preparation and 1% installation". I think that this is extremely appropriate to programming and computers in general. Most programming is actually quite easy - even trivial ... once you have figured out the problem you're trying to solve.

        ------
        We are the carpenters and bricklayers of the Information Age.

        Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

        I shouldn't have to say this, but any code, unless otherwise stated, is untested

Re: Stripping domain names from URLs
by Fletch (Bishop) on Sep 08, 2004 at 12:51 UTC

    Your problem statement's still a little vague, and you'd hit snags with people who have an A record for their domain and/or international domains (e.g. "foo.com" or "foo.co.uk"). You might get close by using URI::URL to parse them and then try and look up an NS record with Net::DNS (strip components and stopping when you hit a "top level" server such as "com" or "co.uk").

Re: Stripping domain names from URLs
by Steve_p (Priest) on Sep 08, 2004 at 12:47 UTC
    Have you looked into URI?
Re: Stripping domain names from URLs
by wfsp (Abbot) on Sep 08, 2004 at 12:55 UTC
    Have you looked at URI that comes with Perl? Does this help?