in reply to Regex to match file extension in URL

Well it depends on eaxtly what circumstances you want to deal with. For instance it is common on my systems to have files with multiple extensions like so
D:\Temp\file.ext.bak E:\Code\junk.stuff.ext
So it depends if you just want the abolute LAST extension or if you want them all (ie everything after the first .) Also it there are a few other issues as you are dealing with urls... Will there be a parameter string attached to the url when you want to do this?.

Anyway enough muttering, the basic way to attack a regex is to not think what you do want to match but to think about what you don't</code> want to match. In my experience when I write a regex based on what I want to match I end up matching all of things I intended to as well as a few that I didn't.

So lets generalize your question

# not all of these are valid urls # nor do I want all of them to match # For instance some invalids I DO want to match # but handle their invalidity later my @url=qw( http://www.foobar.com http://www.foobar.com/foo http://www.foobar.com/foo/ http://www.foobar.com/foo.pl http://www.foobar.com/.extension http://www.foobar.com?test http://www.foobar.com/foo?test http://www.foobar.com/foo/?test http://www.foobar.com/foo.pl? http://www.foobar.com/.extension? http://www.foobar.com/foo/bar/foobar.html http://www.foobar.com/foo/bar/foo.bar.html http://perlmonks.com/index.pl?node_id=68135 http://perlmonks.com/index.pl??node_id=68135 http:///file.ext? http:///.ext? http:///file.ext http:///.ext );
Lets say we want to do something more exciting than just match the extension. Lets try to split this into site,path,filename,extension,parameters. So what do need match and to not match for each one? BTW: I'm sure some people might do this differently and even more likely better, but here's how i would analyze it. Note that as I said earlier where the filename ends and the extension begins is not strictly defined, MS explorer and the like only respect the last extension that is present (try it do a file assoc with 'pm.bak' and '.bak' and see if it treats a file as '.bak' or as '.pm.bak') Also each rule is in context of the rules before it

Which is loosely what we want to match. I say loosely cause I started from this list and it evolved as I tested various cases that I need to rule out more things at various places, as can be seen from the comments in my code.
foreach (@url) { if (my @parts=m!^http:// #must begin http:// ( #capture the site [^/?]+ # site has no / or ? in it ) #its mandatory ( #capture the path / # starts with a / (?: # group but dont capture [^/?]+ # anything but / or ? / # followed by a / )* # zero or more times (opt) )? #all optional ( #capture the filename [^./?] # doesnt start with a . or ? or / [^/?]+? # all chars not / or ? , (ctd.) # --leave stuff for rest of rex )? #we dont have to have a filename ( #capture the extension \. # they start with dots you know [^.?]* # any letter that arent a . or ? )? #we dont need an extension really ( #capture a parameter string \? # it starts with a ? .* # and has any char following )? #but its optional too.. $ #and thats the end folks... !x) { #ignore comments and whitespace in rex print "$_\t".join(',',@parts)."\n"; # weve matched now print } else { print "NOMATCH:$_\n"; #oops, is this ok? } } #lets try the next URL and see if we do better.... # :)
Which produces something like the following output for the above data:
URL Site Path File Ext Params
http://www.foobar.com www.foobar.com - - - -
http://www.foobar.com/foo www.foobar.com / foo - -
http://www.foobar.com/foo/ www.foobar.com /foo/ - - -
http://www.foobar.com/foo.pl www.foobar.com / foo .pl -
http://www.foobar.com/.extension www.foobar.com / - .extension -
http://www.foobar.com?test www.foobar.com - - - ?test
http://www.foobar.com/foo?test www.foobar.com / foo - ?test
http://www.foobar.com/foo/?test www.foobar.com /foo/ - - ?test
http://www.foobar.com/foo.pl? www.foobar.com / foo .pl ?
http://www.foobar.com/.extension? www.foobar.com / - .extension ?
http://www.foobar.com/foo/bar/foobar.html www.foobar.com /foo/bar/ foobar .html -
http://www.foobar.com/foo/bar/foo.bar.html www.foobar.com /foo/bar/ foo.bar .html -
http://perlmonks.com/index.pl?node_id=68135 perlmonks.com / index .pl ?node_id=68135
http://perlmonks.com/index.pl??node_id=68135 perlmonks.com / index .pl ??node_id=68135
http:///file.ext NOMATCH
http:///.ext NOMATCH
http:///file.ext? NOMATCH
http:///.ext? NOMATCH
where '-' (dashes) are unmatched parts.

Ok, Ok, so its not the answer to your exact question.... :-)

Hopefully though theres enough stuff here to help you sort out your problem. Good luck!

Yves
--
You are not ready to use symrefs unless you already know why they are bad. -- tadmc (CLPM)