Creating a regex to match part of a URL's domain name (was: pattern matching)

strfry() has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: pattern matching by davorg (Chancellor) on Jun 06, 2001 at 17:45 UTC
This matches the longest string of word characters that is followed by '.com', `my $domain; if ($str =~ /(\w+)\.com/) { $domain = $1; } else { # no match $domain = ''; }` [download] -- <http://www.dave.org.uk> "Perl makes the fun jobs fun and the boring jobs bearable" - me	[reply] [d/l]
Re: pattern matching by japhy (Canon) on Jun 06, 2001 at 17:55 UTC
Well, putting aside your desire to match the "important" part of a domain name, you probably want to use a regex that says "match and save a set of non-. characters that are followed by a ., then non-. and non-/ characters, and then either a / or the end of the string". Basically, you want to ensure you're a) only looking at the domain name, and b) getting the penultimate .-separated sequence. It would probably be more intuitive to use two `split()`s: `($domain) = split '/', $string; $wanted = (split /\./, $domain)[-2]; # or $wanted = (split /\./, (split '/', $string)[0])[-2];` [download] Here's the regex approach: `($wanted) = $string =~ m{ ( [^.]+ ) # save the non-. sequence to $1 \. # . [^./]+ # the final non-. non-/ sequence (?: / \| $) # / or the end of the string }x;` [download] `japhy` -- Perl and Regex Hacker	[reply] [d/l] [select]
Re: Re: pattern matching by strfry() (Monk) on Jun 06, 2001 at 19:50 UTC
hmm i used part of your code in a subroutine, and it's giving me the error "Use of uninitialized value at ./index.cgi line 29." here's the function: `sub getd { my $string = @_; my $wanted; my $domain; ($domain) = split '/', $string; $wanted = (split /\./, $domain)[-2]; return $wanted; } my $variable = "www.google.com"; print &getd($variable); # this is line 29.` [download] any ideas? strfry()	[reply] [d/l]
Re: Re: Re: pattern matching by japhy (Canon) on Jun 06, 2001 at 19:55 UTC
You're not doing any rudimentary data-checking, or you'd see that `my $string = @_` was assigning a number to your variable. `# try one of these: my ($string) = @_; my $string = shift; my $string = $_[0];` [download] `japhy` -- Perl and Regex Hacker	[reply] [d/l]
Re: Re: Re: Re: pattern matching by strfry() (Monk) on Jun 06, 2001 at 20:15 UTC
Re: Re: Re: Re: Re: pattern matching by japhy (Canon) on Jun 06, 2001 at 20:26 UTC
Some notes below your chosen depth have not been shown here
Re: Re: pattern matching by strfry() (Monk) on Jun 06, 2001 at 21:21 UTC
Re: Re: pattern matching by strfry() (Monk) on Jun 06, 2001 at 18:48 UTC
yes yes yes yes! that's it! thank you! (: now all i have to do is fiddle with it until i understand exactly what's taking place hehe gracias strfry()	[reply]
Re: pattern matching by mirod (Canon) on Jun 06, 2001 at 17:47 UTC
you can use the `($result)= ($string=~ m/pattern/);` idiom this way: #!/usr/bin/perl -w use strict; while( my $string=<DATA>) { my( $domain)= ($string=~ m{(?:^\|\.) # the beginning of the s +tring or . ([^.]*) # anything but . (and st +ore it in $1) \.com # .com (?:\/\|$) # a / or the end of the s +tring }x); print "domain: $domain\n"; } __DATA__ l-12345.in.some.domain.com l-12345.in.some.domain.com/blargh/index.html domain.com/blargh/index.html domain.com domain.com/ l-12345.in.some.domain.com/blargh/foo.com nope.com.domain.com/blargh/foo.com l-12345.in.some.domain.com/blargh/nope.foo.com [download]	[reply] [d/l] [select]
Re: pattern matching by tachyon (Chancellor) on Jun 06, 2001 at 18:09 UTC
This will work and allows domains like foo-bar to capture which using \w does not. This regex looks to the left of the .com and stops grabbing chars at the first dot or forward slash. Also note that you assign to @i which is an array rather than $i which is a scalar and what you probably had in mind. tachyon `my $i = "l-12345.in.some.domain.com/blargh/index.html"; my ($result) = $i =~ m\|([^./]+)\.com\|; print $result; # if you want to allow several endings like .com .gov etc my ($result) = $i =~ m#([^./]+)\.(?:com\|org\|gov\|edu\|etc)#;` [download]	[reply] [d/l]
Re: Re: pattern matching by strfry() (Monk) on Jun 06, 2001 at 18:36 UTC
hmm but i want to be able to match it with anything...disregarding named compliancies (eg: `www.google.somethingnotnormal`) mainly because i'd like to know how, not because i need it (the regex you just showed me works perfectly, actually, and i'm really greatful and all... but i'm curious) (:	[reply] [d/l]
Re: pattern matching by shotgunefx (Parson) on Jun 06, 2001 at 17:52 UTC
first problem is your assigning a scalar to an an array Should be `my $i = "l-12345.in.some.domain.com";` Second, your question seems a bit vague, do you know which two things you need to match? If so you should be able to say `$i =~/($first)\.($second)/;` [download] Now $first is in $1 and $second is in $2 if they where found. Of course this doesn't take into account boundaries. If $first = "dog" and $second ="com" this will match fogdog.com as well. You need to determine what your boundaries are going to be. -Lee "To be civilized is to deny one's nature."	[reply] [d/l] [select]
Re: Re: pattern matching by strfry() (Monk) on Jun 06, 2001 at 18:23 UTC
well, say i have www.google.com; i want to match "google".. not "www", and not "com".. another example.. if i have ww2.mirror.google.com, i still want it to match only "google" does that help any? strfry()	[reply]