Re: Regex to match file extension in URL

Well it depends on eaxtly what circumstances you want to deal with. For instance it is common on my systems to have files with multiple extensions like so

D:\Temp\file.ext.bak
E:\Code\junk.stuff.ext
[download]

So it depends if you just want the abolute LAST extension or if you want them all (ie everything after the first .) Also it there are a few other issues as you are dealing with urls... Will there be a parameter string attached to the url when you want to do this?.

Anyway enough muttering, the basic way to attack a regex is to not think what you do want to match but to think about what you don't</code> want to match. In my experience when I write a regex based on what I want to match I end up matching all of things I intended to as well as a few that I didn't.

So lets generalize your question

           # not all of these are valid urls
           # nor do I want all of them to match
           # For instance some invalids I DO want to match
           # but handle their invalidity later
my @url=qw(
           http://www.foobar.com
           http://www.foobar.com/foo
           http://www.foobar.com/foo/
           http://www.foobar.com/foo.pl
           http://www.foobar.com/.extension
           http://www.foobar.com?test
           http://www.foobar.com/foo?test
           http://www.foobar.com/foo/?test
           http://www.foobar.com/foo.pl?
           http://www.foobar.com/.extension?
           http://www.foobar.com/foo/bar/foobar.html
           http://www.foobar.com/foo/bar/foo.bar.html
           http://perlmonks.com/index.pl?node_id=68135
           http://perlmonks.com/index.pl??node_id=68135
           http:///file.ext?
           http:///.ext?
           http:///file.ext
           http:///.ext
           );
[download]

Lets say we want to do something more exciting than just match the extension. Lets try to split this into site,path,filename,extension,parameters. So what do need match and to not match for each one? BTW: I'm sure some people might do this differently and even more likely better, but here's how i would analyze it. Note that as I said earlier where the filename ends and the extension begins is not strictly defined, MS explorer and the like only respect the last extension that is present (try it do a file assoc with 'pm.bak' and '.bak' and see if it treats a file as '.bak' or as '.pm.bak') Also each rule is in context of the rules before it

Site : Match after the 'http://' everything that doesnt have '/' in it. (or up to the '/')
Path : everything that begins in '/' and ends in a '/' including '/' itself
File* :
- Everthing that doesnt include a dot, slash or question mark
- or Everything up to the last dot that doesnt include a slash or a questionmark
Extension: everything including dots that doesnt include a question mark
Params: everything from and including the question mark to the right

Which is loosely what we want to match. I say loosely cause I started from this list and it evolved as I tested various cases that I need to rule out more things at various places, as can be seen from the comments in my code.

 
foreach (@url) {
    if (my @parts=m!^http://   #must begin http://
                  (            #capture the site
                     [^/?]+    #  site has no / or ? in it
                  )            #its mandatory

                  (            #capture the path
                     /         #  starts with a /
                     (?:       #  group but dont capture
                        [^/?]+ #    anything but / or ?
                        /      #    followed by a /
                     )*        #  zero or more times (opt)
                  )?           #all optional

                  (            #capture the filename
                      [^./?]   #  doesnt start with a . or ? or /
                      [^/?]+?  #  all chars not / or ? , (ctd.)
                               #    --leave stuff for rest of rex
                   )?          #we dont have to have a filename

                  (            #capture the extension
                      \.       #  they start with dots you know
                      [^.?]*   #  any letter that arent a . or ?
                   )?          #we dont need an extension really

                  (            #capture a parameter string
                      \?       #  it starts with a ?
                      .*       #  and has any char following
                  )?           #but its optional too..
                  $            #and thats the end folks...
                !x) {          #ignore comments and whitespace in rex
        print "$_\t".join(',',@parts)."\n"; # weve matched now print
    } else {
        print "NOMATCH:$_\n"; #oops, is this ok?
    }
} #lets try the next URL and see if we do better....
# :)
[download]

Which produces something like the following output for the above data:

URL	Site	Path	File	Ext	Params
http://www.foobar.com	www.foobar.com	-	-	-	-
http://www.foobar.com/foo	www.foobar.com	/	foo	-	-
http://www.foobar.com/foo/	www.foobar.com	/foo/	-	-	-
http://www.foobar.com/foo.pl	www.foobar.com	/	foo	.pl	-
http://www.foobar.com/.extension	www.foobar.com	/	-	.extension	-
http://www.foobar.com?test	www.foobar.com	-	-	-	?test
http://www.foobar.com/foo?test	www.foobar.com	/	foo	-	?test
http://www.foobar.com/foo/?test	www.foobar.com	/foo/	-	-	?test
http://www.foobar.com/foo.pl?	www.foobar.com	/	foo	.pl	?
http://www.foobar.com/.extension?	www.foobar.com	/	-	.extension	?
http://www.foobar.com/foo/bar/foobar.html	www.foobar.com	/foo/bar/	foobar	.html	-
http://www.foobar.com/foo/bar/foo.bar.html	www.foobar.com	/foo/bar/	foo.bar	.html	-
http://perlmonks.com/index.pl?node_id=68135	perlmonks.com	/	index	.pl	?node_id=68135
http://perlmonks.com/index.pl??node_id=68135	perlmonks.com	/	index	.pl	??node_id=68135
http:///file.ext	NOMATCH
http:///.ext	NOMATCH
http:///file.ext?	NOMATCH
http:///.ext?	NOMATCH

where '-' (dashes) are unmatched parts.

Ok, Ok, so its not the answer to your exact question.... :-)

Hopefully though theres enough stuff here to help you sort out your problem. Good luck!

Yves
--
You are not ready to use symrefs unless you already know why they are bad. -- tadmc (CLPM)

Comment on Re: Regex to match file extension in URL Select or Download Code