AddamB has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I'm looking for a more elegant alternative to: [a-z1-9.\/\:]+(.com|.net|.co)+[.A-Z1-9a-z]*

example: matches http://www.mydomain.com.au in http://www.mydomain.com.au/test.html

Can anyone suggest a way to replace that (.com|.net|.co) with an all-inclusive search?

:-) Much appreciated, Addam

Edit: chipmunk 2001-03-22

Replies are listed 'Best First'.
Re: regex to extract fully-qualified domain name from full URL
by tadman (Prior) on Mar 22, 2001 at 08:03 UTC
    Why not use URI? It will tell you what you need to know:
    my ($uri) = new URI("http://www.mydomain.com.au/test.html"); print $uri->host,"\n";
    It's probably far better to use that than a regexp, though, of course, you would not be able to validate the validity of any given Top Level Domain (TLD).

    It used to be that top level domains (TLDs) were for countries (i.e. ".ca", ".au", ".uk") or types of companies (".com", ".net", ".org", or even ".mil") and were fairly predictable. These days, with countries being invaded and assimilated, or spliting up because of civil war (frighteningly frequent in Eastern Europe), the TLDs are always changing. The ISO-3166 specifies the country codes for various national entities.

    Added to this is the likes of Esther Dyson, chairperson of ICANN, which is proposing to add things like ".museum" to the TLD namespace.

    Don't forget the UTF-5 encoded "iTLDs" which are being issued by VeriSign and others. These look really wacky unless your browser supports them, but they mean things like ".com" in Japanese, Chinese, Korean, and other languages that aren't based on the Latin character set.

    The only way to know for sure, if only for a short period of time (i.e. a month or so) before requiring an update, is to process the root zone file which lists all the servers for all the active domains.

Re: regex to extract fully-qualified domain name from full URL
by andye (Curate) on Mar 22, 2001 at 15:40 UTC
    tadman is right about using modules, but just thinking about the regexp, I think
    ^(http://[^/]*)
    would do what you want, working from your above example...

    my $foo = "http://www.hostname.co.uk/foo/bar"; print $foo =~ m#^(http://[^/]*)#;
    outputs http://www.hostname.co.uk - and it should work ok with IP addresses and the http://username:password@hostname/ format, etc

    andy.

      To be more thorough, perhaps:    ^(https?://[^/]*) Or further:    ^((?:https?|mailto)://[^/]*) You should also hope that your 'username' and 'password' do not contain any slashes. The only restriction would appear to be that the username cannot contain a ':', and the password cannot contain an '@', though this could be browser dependent.

        ooo, you're quite right - https completely slipped my mind.
        I'd have to agree with
        ^(https?://[^/]*)
        But frankly, if you're going to include mailto, I think by rights all the other (multifarious) possibilities ought to match as well... in which case it really is time to reach for a module, as you initially suggested.

        I disagree with you about possible slashes, ats and other funny characters in the username and password though - my (cursory) examination of the RFCs indicates they're both 'unsafe' and 'reserved' - and it says... Within the user and password field, any ":", "@", or "/" must be encoded RFC1738 - not sure this is still the current one though (?). And this seems to make sense, given the slash is a delimiter within the URL.

        andy.

        looking into it further... RFC1738 superceded by RFC2396... but I need to go and do some Real Work... ;)

Re: regex to extract fully-qualified domain name from full URL
by AddamB (Initiate) on Mar 22, 2001 at 07:56 UTC
    woah... sorry I'm a newbie to these parts... make that regex: [a-z1-9.\/\:]+(.com|.net|.co)+[.A-Z1-9a-z]*
Re: regex to extract fully-qualified domain name from full URL
by Wodin (Acolyte) on Mar 22, 2001 at 09:47 UTC
    Couldn't you do something along the lines of
    (\.\w\w\w?)+
    to make it match any set of two or three word characters beginning with a dot? I would think this solution would take care of most of the problems, though it is an ugly hack.