Regex: Extract base URL for a specific domain

alpha-lemming has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, I am trying to extract a list of URLs for all subdomains of the domain "foo.com" from a very messy DB dump. Here's my code and some sample data:

while (<DATA>) {
        my @urls = ( $_ =~ /(https?\:\/\/.*?\.foo\.com)/g);
        foreach my $url (@urls) {

        print "$url\n" if $1;
    }

}

__DATA__
http://www.foo.com/fishnuts
http://smtp.foo.com https://www.foo.com/?(bunch-of-stuff):{}https://sv
+n.foo.com/docs
https://yahoo.de/?search:{width}-https://www.foo.com
https://google.com https://foo.com:(More-random-stuff)https://yahoo.de
+::http://pubdocs.foo.com/top/index.html
[download]

The desired result would be:

http://www.foo.com
http://smtp.foo.com
https://www.foo.com
https://svn.foo.com
https://www.foo.com
https://foo.com
http://pubdocs.foo.com
[download]

However, that regex does not work with the last two lines of data, as it also matches starting from the first "http". I tried this:

/(https?\:\/\/(?!http.)*?\.foo\.com)/g
[download]

But get "matches null string many times in regex"

Thanks for any help!

Comment on Regex: Extract base URL for a specific domain Select or Download Code

Replies are listed 'Best First'.
Re: Regex: Extract base URL for a specific domain by hdb (Monsignor) on Aug 29, 2013 at 14:00 UTC
This should work: `my @urls = m\|(https?://[^/]+[.]foo[.]com)\|g;` [download]	[reply] [d/l]
Re^2: Regex: Extract base URL for a specific domain by alpha-lemming (Novice) on Aug 29, 2013 at 14:51 UTC
That works great, thanks!	[reply]
Re: Regex: Extract base URL for a specific domain by MidLifeXis (Monsignor) on Aug 29, 2013 at 13:47 UTC
How about using URI, and matching the end of the `$uri->host` or `$uri->ihost` methods? --MidLifeXis	[reply] [d/l] [select]
Re: Regex: Extract base URL for a specific domain by daxim (Curate) on Aug 29, 2013 at 14:01 UTC
The output of the urifind filter has a large overlap with your desired result.	[reply]
Re^2: Regex: Extract base URL for a specific domain by alpha-lemming (Novice) on Aug 29, 2013 at 14:53 UTC
Yeah, I did try URI::Find, but it was getting confused by some of the junk in the file and returning overlapping URIs	[reply]