Stripping domain names from URLs

robharper has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Stripping domain names from URLs by tachyon (Chancellor) on Sep 08, 2004 at 13:39 UTC
There is consistency with how domains are arranged vis TLDs and ccTLDs and SLDs but it varies by country. I whipped up this module some time ago and expect it does what you want. It has all the TLDs, ccTLDs, and the SLDs for the major and easily accessible ccTLDs. Sorry there are no docs but it is pretty simple to RTFS. It is basically 400 lines of data with about 50 lines of code at the end. get_domain( URL, FLAG ) is probably what you want. See the source for what the flags do but you need to pass either 1 or 2 to get the domain only or subdomain(s).domain respectively. Read more... (22 kB)	[reply] [d/l]
Re^2: Stripping domain names from URLs by robharper (Pilgrim) on Sep 08, 2004 at 13:58 UTC
Thankyou, tachyon, that looks very helpful. Thanks also to everyone else who commented -- I will take a closer look at URI::URL and at learning how to be more precise with defining my problems. :o)	[reply]
Re: Stripping domain names from URLs by dragonchild (Archbishop) on Sep 08, 2004 at 12:46 UTC
It sounds like you're suffering from poor definitions. First, you need to define what constitutes a domain name. Then, you need to define how to determine the exceptions. Then, and only then, can you code something reasonable to deal with it. What I would do is take a long look at how you, personally, parse a domain name. That will give you the definition and exceptions. And, no, it's not `s/^[^.]\.//`, in case you're wondering. ------ We are the carpenters and bricklayers of the Information Age. Then there are Damian modules.... sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon.* - flyingmoose I shouldn't have to say this, but any code, unless otherwise stated, is untested	[reply] [d/l]
Re^2: Stripping domain names from URLs by tachyon (Chancellor) on Sep 08, 2004 at 13:55 UTC
Actually it is a reasonable description of the problem. Domain names exist in 3 basic forms: `domain.TLD where TLD (Top Level Domain) is com, net, org etc domain.ccTLD where ccTLD (Country Code TLD) is say .de domain.SLD.ccTLD where SLD.ccTLD (Second Level Domain ccTLD) is com.au +, co.uk, etc # additionally we may have as a prefix www.domain.... subdomain.domain..... subsubdomain.subdomain.domain.....` [download] So the task is how to extract the DOMAIN component +/- subdomains. The issue is that if you split on the . then the DOMAIN component has an index of either 0,1,2...N if you look from the left or -3 or -2 if you look from the right. See my post below for a heuristic that deals with most of the weirdness. The most annoying freature is that domian.ccTLD exists. If not all you would need to do is pop 1 element off if it was a TLD and 2 if it was a ccTLD. Alas that would be too easy! cheers tachyon	[reply] [d/l]
Re^3: Stripping domain names from URLs by dragonchild (Archbishop) on Sep 08, 2004 at 14:26 UTC
I understand that there is a precise problem statement for the issue the OP was dealing with, as you have stated. What I was getting at was the the OP hadn't found that precise statement. This is actually a more fundamental concept issue with our field in general - most programmers don't realize that development is at least 80% thinking and no more than 20% typing. I was reading some onLamp.com articles on FreeBSD installations and one author kept saying that installs were "99% preparation and 1% installation". I think that this is extremely appropriate to programming and computers in general. Most programming is actually quite easy - even trivial ... once you have figured out the problem you're trying to solve. ------ We are the carpenters and bricklayers of the Information Age. Then there are Damian modules.... sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon.* - flyingmoose I shouldn't have to say this, but any code, unless otherwise stated, is untested	[reply]
Re: Stripping domain names from URLs by Fletch (Bishop) on Sep 08, 2004 at 12:51 UTC
Your problem statement's still a little vague, and you'd hit snags with people who have an A record for their domain and/or international domains (e.g. "foo.com" or "foo.co.uk"). You might get close by using URI::URL to parse them and then try and look up an NS record with Net::DNS (strip components and stopping when you hit a "top level" server such as "com" or "co.uk").	[reply]
Re: Stripping domain names from URLs by Steve_p (Priest) on Sep 08, 2004 at 12:47 UTC
Have you looked into URI?	[reply]
Re: Stripping domain names from URLs by wfsp (Abbot) on Sep 08, 2004 at 12:55 UTC
Have you looked at URI that comes with Perl? Does this help?	[reply]