Why not use URI? It will tell you what you need to know:
my ($uri) = new URI("http://www.mydomain.com.au/test.html");
print $uri->host,"\n";
It's probably far better to use that than a regexp, though,
of course, you would not be able to validate the validity
of any given Top Level Domain (TLD).
It used to be that top level domains (TLDs) were for countries
(i.e. ".ca", ".au", ".uk") or types of companies (".com",
".net", ".org", or even ".mil") and were fairly predictable.
These days, with countries being invaded and assimilated,
or spliting up because of civil war (frighteningly
frequent in Eastern Europe), the TLDs are always changing.
The ISO-3166
specifies the country codes for various national entities.
Added to this is the likes of Esther Dyson, chairperson of
ICANN, which is proposing
to add things like ".museum" to the TLD namespace.
Don't forget the UTF-5 encoded "iTLDs" which are being
issued by VeriSign
and others. These look really wacky unless your browser
supports them, but they mean things like ".com" in Japanese,
Chinese, Korean, and other languages that aren't based on
the Latin character set.
The only way to know for sure, if only for a short period of
time (i.e. a month or so) before requiring an update, is to
process the root zone file which lists all the servers
for all the active domains.
| [reply] [d/l] |
tadman is right about using modules, but just thinking about the regexp, I think ^(http://[^/]*)
would do what you want, working from your above example...
my $foo = "http://www.hostname.co.uk/foo/bar";
print $foo =~ m#^(http://[^/]*)#;
outputs http://www.hostname.co.uk - and it should work ok with IP addresses and the http://username:password@hostname/ format, etc
andy. | [reply] [d/l] [select] |
To be more thorough, perhaps:
^(https?://[^/]*)
Or further:
^((?:https?|mailto)://[^/]*)
You should also hope that your 'username' and 'password' do
not contain any slashes. The only restriction would appear
to be that the username cannot contain a ':', and the password
cannot contain an '@', though this could be browser dependent.
| [reply] [d/l] [select] |
ooo, you're quite right - https completely slipped my mind.
I'd have to agree with
^(https?://[^/]*)
But frankly, if you're going to include mailto, I think by rights all the other (multifarious) possibilities ought to match as well... in which case it really is time to reach for a module, as you initially suggested.
I disagree with you about possible slashes, ats and other funny characters in the username and password though - my (cursory) examination of the RFCs indicates they're both 'unsafe' and 'reserved' - and it says... Within the user and password field, any ":", "@", or "/" must be encoded RFC1738 - not sure this is still the current one though (?). And this seems to make sense, given the slash is a delimiter within the URL.
andy.
looking into it further... RFC1738 superceded by RFC2396... but I need to go and do some Real Work... ;)
| [reply] [d/l] |
woah... sorry I'm a newbie to these parts...
make that regex: [a-z1-9.\/\:]+(.com|.net|.co)+[.A-Z1-9a-z]*
| [reply] [d/l] |
Couldn't you do something along the lines of
(\.\w\w\w?)+
to make it match any set of two or three word characters beginning with a dot? I would think this solution would take care of most of the problems, though it is an ugly hack. | [reply] [d/l] |