in reply to Regex to match file extension in URL
So it depends if you just want the abolute LAST extension or if you want them all (ie everything after the first .) Also it there are a few other issues as you are dealing with urls... Will there be a parameter string attached to the url when you want to do this?.D:\Temp\file.ext.bak E:\Code\junk.stuff.ext
Anyway enough muttering, the basic way to attack a regex is to not think what you do want to match but to think about what you don't</code> want to match. In my experience when I write a regex based on what I want to match I end up matching all of things I intended to as well as a few that I didn't.
So lets generalize your question
Lets say we want to do something more exciting than just match the extension. Lets try to split this into site,path,filename,extension,parameters. So what do need match and to not match for each one? BTW: I'm sure some people might do this differently and even more likely better, but here's how i would analyze it. Note that as I said earlier where the filename ends and the extension begins is not strictly defined, MS explorer and the like only respect the last extension that is present (try it do a file assoc with 'pm.bak' and '.bak' and see if it treats a file as '.bak' or as '.pm.bak') Also each rule is in context of the rules before it# not all of these are valid urls # nor do I want all of them to match # For instance some invalids I DO want to match # but handle their invalidity later my @url=qw( http://www.foobar.com http://www.foobar.com/foo http://www.foobar.com/foo/ http://www.foobar.com/foo.pl http://www.foobar.com/.extension http://www.foobar.com?test http://www.foobar.com/foo?test http://www.foobar.com/foo/?test http://www.foobar.com/foo.pl? http://www.foobar.com/.extension? http://www.foobar.com/foo/bar/foobar.html http://www.foobar.com/foo/bar/foo.bar.html http://perlmonks.com/index.pl?node_id=68135 http://perlmonks.com/index.pl??node_id=68135 http:///file.ext? http:///.ext? http:///file.ext http:///.ext );
Which produces something like the following output for the above data:foreach (@url) { if (my @parts=m!^http:// #must begin http:// ( #capture the site [^/?]+ # site has no / or ? in it ) #its mandatory ( #capture the path / # starts with a / (?: # group but dont capture [^/?]+ # anything but / or ? / # followed by a / )* # zero or more times (opt) )? #all optional ( #capture the filename [^./?] # doesnt start with a . or ? or / [^/?]+? # all chars not / or ? , (ctd.) # --leave stuff for rest of rex )? #we dont have to have a filename ( #capture the extension \. # they start with dots you know [^.?]* # any letter that arent a . or ? )? #we dont need an extension really ( #capture a parameter string \? # it starts with a ? .* # and has any char following )? #but its optional too.. $ #and thats the end folks... !x) { #ignore comments and whitespace in rex print "$_\t".join(',',@parts)."\n"; # weve matched now print } else { print "NOMATCH:$_\n"; #oops, is this ok? } } #lets try the next URL and see if we do better.... # :)
| URL | Site | Path | File | Ext | Params |
|---|---|---|---|---|---|
| http://www.foobar.com | www.foobar.com | - | - | - | - |
| http://www.foobar.com/foo | www.foobar.com | / | foo | - | - |
| http://www.foobar.com/foo/ | www.foobar.com | /foo/ | - | - | - |
| http://www.foobar.com/foo.pl | www.foobar.com | / | foo | .pl | - |
| http://www.foobar.com/.extension | www.foobar.com | / | - | .extension | - |
| http://www.foobar.com?test | www.foobar.com | - | - | - | ?test |
| http://www.foobar.com/foo?test | www.foobar.com | / | foo | - | ?test |
| http://www.foobar.com/foo/?test | www.foobar.com | /foo/ | - | - | ?test |
| http://www.foobar.com/foo.pl? | www.foobar.com | / | foo | .pl | ? |
| http://www.foobar.com/.extension? | www.foobar.com | / | - | .extension | ? |
| http://www.foobar.com/foo/bar/foobar.html | www.foobar.com | /foo/bar/ | foobar | .html | - |
| http://www.foobar.com/foo/bar/foo.bar.html | www.foobar.com | /foo/bar/ | foo.bar | .html | - |
| http://perlmonks.com/index.pl?node_id=68135 | perlmonks.com | / | index | .pl | ?node_id=68135 |
| http://perlmonks.com/index.pl??node_id=68135 | perlmonks.com | / | index | .pl | ??node_id=68135 |
| http:///file.ext | NOMATCH | ||||
| http:///.ext | NOMATCH | ||||
| http:///file.ext? | NOMATCH | ||||
| http:///.ext? | NOMATCH | ||||
Ok, Ok, so its not the answer to your exact question.... :-)
Hopefully though theres enough stuff here to help you sort out your problem. Good luck!
Yves
--
You are not ready to use symrefs unless you already know why they are bad. -- tadmc (CLPM)
|
|---|