My domain name stripper. Takes a Full URL with GET variables, an email address, www.domain.com domain.com address and spits out either fqdn.domain.com or domain.com for use when running under taint mode, depending on the type requested.

$self->stripper('fqdn', $url); # www.domain.com $self->stripper('regdn', $url); #domain.com
Note: This only works for single dot TLDs (.com, .net, .info, .ca, etc) and not intl TLDs like .co.uk etc with multiple dots.
sub stripper { my $type = shift; my $url = shift; if ( $url =~ m/((?:(?:https?|ftp|irc):\/\/|(?:(?:(www)|(ftp))[\w-]*\.))?[-\w\/~\@:] ++\.\S+[\w\/])/i ) { $url = "$1"; } else { &err($url); } $url =~ s/^https?:\/\/|mailto:|(.*)\@//ig ; # get http(s)://, mailto:, and email@ $url =~ s/\/.*//; #Strip out the / and everything aft +er it $url =~ s/[\?\#\:].*//; # Get any GET vars my $fqdn = $url; my @domain = split( /\./, $url ); # We have to do this backwards (com.domain.sub) my $tld = pop(@domain); #.com my $secld = pop(@domain); #.domain my @result = ( $secld, $tld ); my $regdn = join( "\.", @result ); if ( $type eq "fqdn" ) { return $fqdn; } else { return $regdn; } }

Replies are listed 'Best First'.
Re: URL, etc to Domain Name Stripper
by ikegami (Patriarch) on Dec 30, 2009 at 22:12 UTC

    To do the advertised function, the sub should extract the host from an URL, determine if the host is a domain, and strip down the domain to the company level.

    The first part can be done using

    defined( my $host = URI->new($url)->host() ) or die("Unable to determine the host of URL $url\n");

    Instead, the presented sub attempts to do the first two parts at the same time and does a bad job.

    • It finds a domain in some URLs that don't have a host.
    • It finds a domain in some URLs that don't have a domain for host.
    • It finds no domain in some URLs that do have a domain for host.

    As for the third step, the sub just guesses as the OP admitted himself.

    The posted sub also handles errors oddly, but that's trivial to fix.

    This only works for single dot TLDs (.com, .net, .info, .ca, etc) and not intl TLDs like .co.uk etc with multiple dots.

    Canada has .ca, .province.ca and .city.province.ca as suffixes, not just .ca. For example,

    • tkf.toronto.on.ca - Toronto Kite Fliers
    • senecac.on.ca - Seneca College
    • ttc.ca - Toronto Transit Commission

    company.ca used to only be available to federally incorporated institutions.

Re: URL, etc to Domain Name Stripper
by MidLifeXis (Monsignor) on Dec 30, 2009 at 22:20 UTC

    Given the clarification that this is in a taint-safe environment, I would be hesitant to use this.

    • It uses a hand-rolled URL parser
    • Example uses object method calling, code uses non-object parameter parsing (you are not collecting $self in the function)
    • &err call uses '&' to prefix the call. This should only be used in certain cases, and this is not one f them.
    • Complex patterns to read. Not easily verifiable.

    Just my $0.02. With inflation, it is not worth much.

    --MidLifeXis

Re: URL, etc to Domain Name Stripper
by MidLifeXis (Monsignor) on Dec 30, 2009 at 21:32 UTC

    When you say "taint approved stripper", do you mean that it returns a hostname suitable for use in a tainted environment?

    --MidLifeXis

      Yes, sorry. This can be used when running with the -T argument.

        By your definition, the following is "taint approved" as well:

        my $untained = $tainted =~ /^(.*)/s;

        Something's that safe for use under -T is something that's guaranteed to deliver exactly what it promises to deliver, and your code does not do that.

        print stripper('fqdn', 'www.a.;EVIL!/') # www.a.;EVIL!

        EVIL! can't contain [\s/] which makes it impractical as an attack vector in most situations, but there's no way that what the sub returns should be considered safe.

Re: URL, etc to Domain Name Stripper
by jnbek (Scribe) on Dec 31, 2009 at 17:22 UTC

    Wow, I must say I wasn't expecting as many responses, as was received. I wrote this code a couple years back in response to one of my questions about the same thing, I don't remember the article now, but I am open for suggestions to improve this old hunk of code :)

    Thanks for all your honest replies too.