Amoe has asked for the wisdom of the Perl Monks concerning the following question:
|
|---|
| Replies are listed 'Best First'. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Re: Regex to match file extension in URL
by demerphq (Chancellor) on Sep 09, 2001 at 19:05 UTC | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
So it depends if you just want the abolute LAST extension or if you want them all (ie everything after the first .) Also it there are a few other issues as you are dealing with urls... Will there be a parameter string attached to the url when you want to do this?. Anyway enough muttering, the basic way to attack a regex is to not think what you do want to match but to think about what you don't</code> want to match. In my experience when I write a regex based on what I want to match I end up matching all of things I intended to as well as a few that I didn't. So lets generalize your question Lets say we want to do something more exciting than just match the extension. Lets try to split this into site,path,filename,extension,parameters. So what do need match and to not match for each one? BTW: I'm sure some people might do this differently and even more likely better, but here's how i would analyze it. Note that as I said earlier where the filename ends and the extension begins is not strictly defined, MS explorer and the like only respect the last extension that is present (try it do a file assoc with 'pm.bak' and '.bak' and see if it treats a file as '.bak' or as '.pm.bak') Also each rule is in context of the rules before it
Which produces something like the following output for the above data:
Ok, Ok, so its not the answer to your exact question.... :-) Hopefully though theres enough stuff here to help you sort out your problem. Good luck!
Yves | [reply] [d/l] [select] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Re: Regex to match file extension in URL
by enoch (Chaplain) on Sep 09, 2001 at 18:37 UTC | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I hope that helps. Jeremy | [reply] [d/l] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Re: Regex to match file extension in URL
by Jazz (Curate) on Sep 09, 2001 at 23:50 UTC | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Note that this code will not handle multi-level extensions, such as .tar.gz. The extension for dist.tar.gz will be reported as .gz (same deal with demerphq's code). For extensions of this type, you'll probably need to create an array that's propagated with valid file extensions. Coincidentally, you can throw this array at File::Basename to easily ignore invalid extensions. Example: The above code will list a suffix only for the file types noted in @valid_extensions (not the txt or cgi files). Jasmine | [reply] [d/l] [select] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
by demerphq (Chancellor) on Sep 10, 2001 at 14:14 UTC | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This alternative uses File::Basename to extract the filename and the query string.
Hey! Thats cheating! :-) The non-trivial nature of tokenizing such a string is illustrated incidentally in the post by crazyinsomniac. Now this is a senior monk, with undoubtadly considerable experience, yet clearly he didn't examine too many cases with either his substr/index solution, nor with his regex solution. When I run his solutions against my earlier posted testdata I get some perverse results indeed. (The regex and substr version dont even produce the same results)
Actually for me there is a moral here, MOST times that I have seen this type of issue attacked with substr() and index() the result is wrong! There is a notable pain in the ass poster on CLPM (who shall remain nameless, scales and all) who insists on solving every problem she can with substr and index and rindex. Most of these 'solutions' crack under proper test data. On the regex level there is another moral, obvious intuitive regexes in my experience dont usually work the way one might wish. :-) Note that this code will not handle multi-level extensions, such as .tar.gz Ahh yes. Originally, as can be seen from the list I provided in my OP, I intended to post two solutions, one along the MS type lines one along a more natural 'bundled' extension line. However I got a bit distracted by using CGI to output that table (yes it took me a while Amoe but thats ok, I was using it to learn basic cgi) and completely forgot to post the other solution. :-) So in penance I offer the two variants of the above regex. One will return all of the extensions bundled together, the other will return ONLY the last two or less extensions. This second variant could easily be modified for whatever level of bundling is required. I havent included the full regex, these two snippets should fit in place over my earlier filename part and extension part leaving the other parts untouched.
Anyway, Jazz thanks for the analysis, I didnt know that bit about the @valid_extensions in File::Basename. Yves
-- | [reply] [d/l] [select] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Re: Regex to match file extension in URL
by jplindstrom (Monsignor) on Sep 09, 2001 at 16:23 UTC | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
$myvar =~ /\.html$/; The $ will anchor the match to the end of the string. If that wasn't what you wanted, could you provide examples of what you want as a result of the match? /J | [reply] [d/l] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
by Amoe (Friar) on Sep 09, 2001 at 16:26 UTC | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| [reply] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
by Trimbach (Curate) on Sep 09, 2001 at 16:44 UTC | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Of course, this won't work for URL's with an implicit filename, like "http://www.yahoo.com" or "http://www.somewhere.com/home/" You'll have to catch those bad boys elsewhere in your code. Gary Blackburn | [reply] [d/l] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
by Amoe (Friar) on Sep 09, 2001 at 17:15 UTC | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
by Trimbach (Curate) on Sep 09, 2001 at 19:07 UTC | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
(crazyinsomniac) Re: Regex to match file extension in URL
by crazyinsomniac (Prior) on Sep 10, 2001 at 09:10 UTC | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Looks like what you asked for to me. Also, CGI.pm has regexes that will give you all kinds of good stuff from the query string/request url .... you can either use CGI.pm or steal the code from the module depending on your needs(script_name() path_translated() path_info()).
However, a regular expression might be "easier" to digest, something along the lines (like "others" have already shown) update: Thu Sep 13 10:26:58 2001 GMT demerphq: has some valid points. My regex obviously isn't complete, and my "substr & index" solution, which I say is the way to tackle this, isn't "validating" and doesn't handle all the possible cases, but then again, it doesn't look like it does. I reccommended using CGI.pm ... for simply getting the file extension from a url, ignoring the possibility of a querystring, and assuming that the filename is in the name.extension format, print substr 'file.htm', 1 + rindex 'file.htm', '.'; cannot be beat. Anyway, lots of good reading in this thread. | [reply] [d/l] [select] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Re: Regex to match file extension in URL
by Amoe (Friar) on Sep 10, 2001 at 00:52 UTC | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| [reply] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||