Re: Regex for extracting a domain name from a string.

I don't have a ready-made solution for you. For this kind of problem, I usually can recommend some part of Regexp::Common, but it doesn't seem to have a module for this.

It sounds as if you want sub.company.domx.domy as well as just company.domx.domy where domx is optional, depending on what domy is.

I'm not sure what the allowable alphabet for domain parts is, but for now I'll say

$dom = qr{
          [a-z]       # starts with a letter
          [a-z0-9-]*  # zero or more letters, numbers, hyphens
         }xi;
[download]

From there, get a list of Internet top-level domains. Look through and see which of those you want an "extra" domain for. For au, for example, you want

$au = qr{ $dom \.   # company
          $dom \.   # subdomain
          au        # ccTLD
          \. ?      # optional trailing dot
          \b        # word break
         }xi;
[download]

But for the mighty .com, it's just

$com = qr{ $dom \.  # company
           com      # gTLD
           \. ?     # optional trailing dot
           \b       # word break
         }xi;
[download]

The word break at the end keeps us from matching i.am.a.silly.com.administrator.at.example.com as silly.com instead of example.com. This might not be the right thing for your input, though. It might be better to say (?! $dom) there instead (negative look ahead to confirm there's no more $dom stuff).

The optional trailing dot allows "example.com." as well as "example.com" (both are "legal").

Given these, you can make

my $domain = qr{ (?:        # start of group
                   $dom \.  # company subdomain
                 ) ?        # end of group, make it optional
                 (?:        # start of group
                   $com     # .com pattern
                   |        # or
                   $au      # .au pattern
                 )          # end of group
               }xi;
[download]

Having written all this and then done some testing, I find that this is harder than I thought (I should have known!). I'll stop now and provide what I got so far along with the tests that show it doesn't work. Expanding the testing to more interesting cases once you have a solution should be easy.

use strict;
use warnings;

use Test::More 'no_plan';

my $dom = qr{
              [a-z]       # starts with a letter
              [a-z0-9-]*  # zero or more letters, numbers, hyphens
            }xi;
my $au = qr{ $dom \.   # company
             $dom \.   # subdomain
             au        # ccTLD
             \. ?      # optional trailing dot
             \b        # word break
            }xi;
my $com = qr{ $dom \.  # company
              com      # gTLD
              \. ?     # optional trailing dot
              \b       # word break
             }xi;
my $domain = qr{
                 (?:         # start of group
                   $dom \.   # company subdomain
                 ) ?         # end of group, make it optional
                 (?:         # start of group
                   $com      # .com pattern
                   |         # or
                   $au       # .au pattern
                 )           # end of group
                }xi;

my @no_match = qw( illegal_underscore.com
                   example.unreal.dom
                   too-short.au
                   0digits-allowed.com
              );

foreach my $bad ( @no_match ) {
    unlike( $bad, $domain, "no match: '$bad'" );
}

my %match_for = ( 'foo.sub.example.com'    => 'sub.example.com',
                  'bar.sub.example.com.au' => 'sub.example.com.au',
             );

while ( my ( $stuff, $find ) = each %match_for ) {
    my $test_name = "match: '$stuff'";
    if ( $stuff =~ qr{ ( $domain ) }x ) {
        my $matched = $1;
        pass( $test_name );
        is( $matched, $find, "found: '$find'" );
    }
    else {
        fail( $test_name );
    }
}
[download]

One problem revealed by the tests is that the pattern will be happy to match merely example.com when faced with example.com.au. Also, it will match domains with illegal characters in them, basically by pretending that the name ends at the illegal character. For example illegal_underscore.com matches as underscore.com. Depending on your application, that may be acceptable, but I don't like it much.

Comment on Re: Regex for extracting a domain name from a string. Select or Download Code