hrcerq has asked for the wisdom of the Perl Monks concerning the following question:

Hello again.

I always used naive regexps for hostname validation. But recently I've been trying to build something more robust and more adherent to related RFCs.

Mostly, I've consulted the following RFCs:

From that I understand that:

If the hostname is qualified (i.e. there are at least 2 labels), then:

BTW, consulting RFCs sometimes feels like walking a complex maze full of hidden traps, because there's always some obscure detail you might overlook.

Things get worse if we consider some hostnames in the wild not adherent to these rules (e.g. some use underscores, which is valid for DNS, but not when used in hostnames), and also that there exist internationalized domain names.

I've tested my regex, but chances are, there are corner cases I'm not aware of, so maybe anyone you might help me find such cases.

This is how I'm doing:

my $hname_re = qr/ ^ (?=(?&validchar){1,255}$) (?!\d+$) (?&label) (?: (?:\.(?&label))* \.(?&tld) \.? )? $ (?(DEFINE) (?<validchar>[a-zA-Z0-9.-]) (?<alnum>[a-zA-Z0-9]) (?<alnumdash>[a-z-A-Z0-9]) (?<label>(?> (?&alnum) (?: (?&alnumdash){,61} (?&alnum) )? ) ) (?<tld>(?!(\d+|.)\.?$) (?&label) ) ) /x;

Thanks for any suggestions.

return on_success() or die;

Replies are listed 'Best First'.
Re: Regex for hostname validation
by Discipulus (Canon) on May 02, 2025 at 08:05 UTC
    hello hrcerq,

    sorry not to be the right guy to review your regex, but I'd suggest a totally different approach.

    As you already extracted rules from the fun RFC plethora and, as you said, there are many complications and dark corners I'd prefere a more verbose but easier to implement and to expand dedicated subroutine or, in ideal world, a dedicated module to do this: and yes perl's ecosystem is nearly an ideal world and we have Data::Validate::Domain but as I read in your homenode: I like to take the most out of core Perl before resorting to external modules. let assume you wont your own solution.. but look carefully at Data-Validate-Domain.t and also to Regexp-Common's plethora of RFC utilities for URIs...

    To be precise you must be more explicit on what you wont to validate: hostnames to browse or hostnames for a DNS entry? Infact you mention internatiolised hostname but for first ones you have Unicode and for the latter ACE-strings

    ..but going on your own, and assuming you are not looking for DNS entries, I'd go with something like

    sub validate_hostname{ my $candidate = shift; my ($ascii_only, $verbose, $debug) = @_; # leave room for improvme +nts and flexibility my ($return, $descr); # non ASCII if ( $candidate =~ m/[^[:ascii:]]/ ) { # but see: https://perlmo +nks.org/?node_id=11164574 print "Not ACSII\n" if $verbose; # ..accepted? if ($ascii_only){ return wantarray ? (undef,"non ASCII string [$candidate] r +ejected") : undef; } # go with another specialized sub.. validate_hostname_Unicode($candidate,$ascii_only, $verbose, $d +ebug ); } # ASCII # too long.. if (length $candidate >= 255){ $descr = "[$candidate] is too long (".length $candidate." cha +rs)"; print $descr if $verbose; return wantarray ? (undef,$descr) : undef; } # Hostnames might be composed by 1 or more labels (separated by do +ts) unless ($candidate =~ /\./){ $descr = "[$candidate] contains no dots"; print $descr if $verbose; return wantarray ? (undef,$descr) : undef; } # .. more checks for this rule # Each label may have at most 63 characteres ... #..have fun :) }

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

      Thanks for your thoughtful reply.

      ... but look carefully at Data-Validate-Domain.t

      I will. I'm not against using external modules, it's just that I like to keep things minimal, so unless I'd reimplement the entire module and it's not trivial, I'd rather "reinvent the wheel", as core Perl is already a very nice toolbox and we can accomplish a lot with it.

      And yes, looking at the module tests will help a lot, thank you for the suggestion.

      ... hostnames to browse or hostnames for a DNS entry? Infact you mention internatiolised hostname ...

      I mistakenly mentioned IDNs just to make a point on RFCs relationships and the complexities involved, but I don't really have to cope with Unicode here, because I'm dealing with hosts file entries, so even if they used IDNs, they'd be already puny'encoded.

      Yet your suggestion was valuable, because if I wanted to validate names before they're encoded, I might do something like that. Also reminded me that I might (and should) use /a option here.

      I just wonder which approach would have lesser impact on performance, considering I might have to validate many names at once. Guess I'll have to test it.

      return on_success() or die;

Re: Regex for hostname validation
by Fletch (Bishop) on May 02, 2025 at 11:38 UTC

    Seconding the other reply this feels like (almost) an X/Y problem. At first skim over title and first sentence my off the cuff instinct was "Just try and resolve the name to an IP with Net::DNS, or gethostbyname and let your libc handle things" and be done with it. I could see several cases where some of your constraints while "RFC legal" wouldn't apply still (e.g. I've used a custom non-standard internal TLD for local names that's valid in the context I used it; (internal) DNS certainly would have resolved it but it would have failed bullet 2).

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

      You're right, I've not been very specific here. In my defense, that was a bit on purpose, because I think we miss some opportunities when we rush to a solution that just works instead of paying attention to why some approach is not good enough.

      But again, I recognize that giving less details than necessary created some doubts. So let me explain what I need here. I'm managing a hosts file that must be periodically updated to filter many domain/hostnames known to be used on ads, trackers, annoyances and malware (by associating them with 0.0.0.0, which somewhat protects the configured machine from accidentally requesting anything from them).

      As you might guess, this file gets very big and I'd like to filter out any record that's invalid anyway, so that it's pointless to add it to the hosts file. Sure, keeping only the legal addresses is not good enough for this purpose, but I intended to add warnings on output for those that are not legal.

      Your suggestion to leverage libc to resolve it is nice, because then I know if it'll be resolved or not, which for this purpose is very important. On the other hand, considering these are hosts known to serve bad things, I'd rather avoid the queries, even if I'm not reaching these machines themselves.

      return on_success() or die;