Amoe has asked for the wisdom of the Perl Monks concerning the following question:

Um, this one is pretty simple. I need a regex to match the ".html" in $myvar = 'http://www.foobar.com/foo/bar/foobar.html';. I have tried, but every attempt matches on the first period in the string and returns the rest, like foobar.com/foo/bar/foobar.html. I read Death to Dot Star!, but had trouble applying Ovid's method's to my own code...Any examples would be very helpful.

Replies are listed 'Best First'.
Re: Regex to match file extension in URL
by demerphq (Chancellor) on Sep 09, 2001 at 19:05 UTC
    Well it depends on eaxtly what circumstances you want to deal with. For instance it is common on my systems to have files with multiple extensions like so
    D:\Temp\file.ext.bak E:\Code\junk.stuff.ext
    So it depends if you just want the abolute LAST extension or if you want them all (ie everything after the first .) Also it there are a few other issues as you are dealing with urls... Will there be a parameter string attached to the url when you want to do this?.

    Anyway enough muttering, the basic way to attack a regex is to not think what you do want to match but to think about what you don't</code> want to match. In my experience when I write a regex based on what I want to match I end up matching all of things I intended to as well as a few that I didn't.

    So lets generalize your question

    # not all of these are valid urls # nor do I want all of them to match # For instance some invalids I DO want to match # but handle their invalidity later my @url=qw( http://www.foobar.com http://www.foobar.com/foo http://www.foobar.com/foo/ http://www.foobar.com/foo.pl http://www.foobar.com/.extension http://www.foobar.com?test http://www.foobar.com/foo?test http://www.foobar.com/foo/?test http://www.foobar.com/foo.pl? http://www.foobar.com/.extension? http://www.foobar.com/foo/bar/foobar.html http://www.foobar.com/foo/bar/foo.bar.html http://perlmonks.com/index.pl?node_id=68135 http://perlmonks.com/index.pl??node_id=68135 http:///file.ext? http:///.ext? http:///file.ext http:///.ext );
    Lets say we want to do something more exciting than just match the extension. Lets try to split this into site,path,filename,extension,parameters. So what do need match and to not match for each one? BTW: I'm sure some people might do this differently and even more likely better, but here's how i would analyze it. Note that as I said earlier where the filename ends and the extension begins is not strictly defined, MS explorer and the like only respect the last extension that is present (try it do a file assoc with 'pm.bak' and '.bak' and see if it treats a file as '.bak' or as '.pm.bak') Also each rule is in context of the rules before it

    • Site : Match after the 'http://' everything that doesnt have '/' in it. (or up to the '/')
    • Path : everything that begins in '/' and ends in a '/' including '/' itself
    • File* :
      • Everthing that doesnt include a dot, slash or question mark
      • or Everything up to the last dot that doesnt include a slash or a questionmark
    • Extension: everything including dots that doesnt include a question mark
    • Params: everything from and including the question mark to the right
    Which is loosely what we want to match. I say loosely cause I started from this list and it evolved as I tested various cases that I need to rule out more things at various places, as can be seen from the comments in my code.
    foreach (@url) { if (my @parts=m!^http:// #must begin http:// ( #capture the site [^/?]+ # site has no / or ? in it ) #its mandatory ( #capture the path / # starts with a / (?: # group but dont capture [^/?]+ # anything but / or ? / # followed by a / )* # zero or more times (opt) )? #all optional ( #capture the filename [^./?] # doesnt start with a . or ? or / [^/?]+? # all chars not / or ? , (ctd.) # --leave stuff for rest of rex )? #we dont have to have a filename ( #capture the extension \. # they start with dots you know [^.?]* # any letter that arent a . or ? )? #we dont need an extension really ( #capture a parameter string \? # it starts with a ? .* # and has any char following )? #but its optional too.. $ #and thats the end folks... !x) { #ignore comments and whitespace in rex print "$_\t".join(',',@parts)."\n"; # weve matched now print } else { print "NOMATCH:$_\n"; #oops, is this ok? } } #lets try the next URL and see if we do better.... # :)
    Which produces something like the following output for the above data:
    URL Site Path File Ext Params
    http://www.foobar.com www.foobar.com - - - -
    http://www.foobar.com/foo www.foobar.com / foo - -
    http://www.foobar.com/foo/ www.foobar.com /foo/ - - -
    http://www.foobar.com/foo.pl www.foobar.com / foo .pl -
    http://www.foobar.com/.extension www.foobar.com / - .extension -
    http://www.foobar.com?test www.foobar.com - - - ?test
    http://www.foobar.com/foo?test www.foobar.com / foo - ?test
    http://www.foobar.com/foo/?test www.foobar.com /foo/ - - ?test
    http://www.foobar.com/foo.pl? www.foobar.com / foo .pl ?
    http://www.foobar.com/.extension? www.foobar.com / - .extension ?
    http://www.foobar.com/foo/bar/foobar.html www.foobar.com /foo/bar/ foobar .html -
    http://www.foobar.com/foo/bar/foo.bar.html www.foobar.com /foo/bar/ foo.bar .html -
    http://perlmonks.com/index.pl?node_id=68135 perlmonks.com / index .pl ?node_id=68135
    http://perlmonks.com/index.pl??node_id=68135 perlmonks.com / index .pl ??node_id=68135
    http:///file.ext NOMATCH
    http:///.ext NOMATCH
    http:///file.ext? NOMATCH
    http:///.ext? NOMATCH
    where '-' (dashes) are unmatched parts.

    Ok, Ok, so its not the answer to your exact question.... :-)

    Hopefully though theres enough stuff here to help you sort out your problem. Good luck!

    Yves
    --
    You are not ready to use symrefs unless you already know why they are bad. -- tadmc (CLPM)

Re: Regex to match file extension in URL
by enoch (Chaplain) on Sep 09, 2001 at 18:37 UTC
    #!/usr/bin/perl -w use strict; my $var = 'http://www.foo.bar/baz/some.html'; ($var =~ / ^ # match the beginning [\w|\.|:|\/]+ # match chars, '.', '/', or ':' \w+\. # match file name followed by dot (\w+(?=$)) # match extension if followed # by the end of line /x) , my $match = $1; print $match . "\n";
    I hope that helps.

    Jeremy
Re: Regex to match file extension in URL
by Jazz (Curate) on Sep 09, 2001 at 23:50 UTC
    This alternative uses File::Basename to extract the filename and the query string. It then uses the extension and parameter capturing portions of demerphq's regex (posted above) to extract the extension and strip the query string, if any.
    #!/usr/bin/perl use File::Basename; use strict; my @files = ( 'http://server.com/subdir/index.html', 'http://server.com/subdir/dist.tar.gz', 'http://server.com/whatever.cgi?testing=1', 'ftp://server.com/pub/whatever.zip', 'file://local/subdir/testing.txt', ); foreach my $file ( @files ){ my $suffix = ( fileparse( $file, '\..*$' ) )[2]; $suffix =~ s/(\.?[^.?]*)?\?.*?$/$1/; print $suffix, "\n"; }

    Note that this code will not handle multi-level extensions, such as .tar.gz. The extension for dist.tar.gz will be reported as .gz (same deal with demerphq's code).

    For extensions of this type, you'll probably need to create an array that's propagated with valid file extensions. Coincidentally, you can throw this array at File::Basename to easily ignore invalid extensions. Example:

    my @valid_extensions = qw/ .tar.gz .html .zip /; foreach my $file ( @files ){ my $suffix = ( fileparse( $file, @valid_extensions ) )[2]; print $suffix, "\n"; }
    The above code will list a suffix only for the file types noted in @valid_extensions (not the txt or cgi files).

    Jasmine

      Hi Jazz,

      This alternative uses File::Basename to extract the filename and the query string.

      Hey! Thats cheating! :-)
      No just kidding. Actually you are very right. Using File::Basename is much better than using a roll your own regex, you are much less likely to find the rex doesnt work on some strange OS, and that some of the weirder cases are propperly handled. (For instance a really robust regex would match BOTH / and \'s) OTOH it _is_ an worthy educational process to learn how to do this. Tokenizing filenames with a regex is not a trivial exercise and IMHO therefore makes a good learning opportunity.

      The non-trivial nature of tokenizing such a string is illustrated incidentally in the post by crazyinsomniac. Now this is a senior monk, with undoubtadly considerable experience, yet clearly he didn't examine too many cases with either his substr/index solution, nor with his regex solution. When I run his solutions against my earlier posted testdata I get some perverse results indeed. (The regex and substr version dont even produce the same results)

      # selected results of CrazyInsomniacs Substr impl. # doubles pacining converted to single by me. http://perlmonks.com/index.pl?node_id=68135 looks like the file name is: index.pl?node_id=68135 and the extension is: pl?node_id=68135 We even got a query string, whoa: node_id=68135 so the true filename would be: index.pl? and the true file extension would b: pl http://www.foobar.com/foo/ looks like the file name is: and the extension is: com/foo/ http://www.foobar.com/foo?test looks like the file name is: foo?test and the extension is: com/foo?test We even got a query string, whoa: test so the true filename would be: foo? and the true file extension would b: com/foo http:///file.ext looks like the file name is: file.ext and the extension is: ext #Selected results of CrazyInsomniacs regex implementation #input string added by me http://www.foobar.com (, , ) (http, www.foobar.com/, foo/bar/foobar.html) http://www.foobar.com/foo/bar/foo.bar.html http:///file.ext (, , )
      Actually for me there is a moral here, MOST times that I have seen this type of issue attacked with substr() and index() the result is wrong! There is a notable pain in the ass poster on CLPM (who shall remain nameless, scales and all) who insists on solving every problem she can with substr and index and rindex. Most of these 'solutions' crack under proper test data. On the regex level there is another moral, obvious intuitive regexes in my experience dont usually work the way one might wish. :-)

      Note that this code will not handle multi-level extensions, such as .tar.gz

      Ahh yes. Originally, as can be seen from the list I provided in my OP, I intended to post two solutions, one along the MS type lines one along a more natural 'bundled' extension line. However I got a bit distracted by using CGI to output that table (yes it took me a while Amoe but thats ok, I was using it to learn basic cgi) and completely forgot to post the other solution. :-)

      So in penance I offer the two variants of the above regex. One will return all of the extensions bundled together, the other will return ONLY the last two or less extensions. This second variant could easily be modified for whatever level of bundling is required. I havent included the full regex, these two snippets should fit in place over my earlier filename part and extension part leaving the other parts untouched.

      # regex snippet for matching # at most two bundled extensions # foobar.gzip -> foobar,.gzip # foobar.tar.gzip -> foobar,.tar.gzip # foo.bar.tar.gzip -> foo.bar,.tar.gzip # the snippt should paste into place over # my earlier matches for filename and extension ( #capture the filename [^./?] # doesnt start with a . or ? or / [^/?]+? # all chars not / or ? , (ctd.) # --leave stuff for rest of rex )? #we dont have to have a filename ( #capture the extension (?: # Group but dont capture \. # they start with dots you know [^?.]* # any letter that arent a . or ? ){0,2} # anywhere from 0 to 2 exts please. ) #thanks.. # regex snippet for matching # filename and all bundled extensions # foobar.gzip -> foobar,.gzip # foobar.tar.gzip -> foobar,.tar.gzip # foo.bar.tar.gzip -> foo,.bar.tar.gzip # the snippt should paste into place over # my earlier matches for filename and extension ( #capture the filename [^./?] # doesnt start with a . or ? or / [^/?.]+? # all chars not / or ? or. # --leave stuff for rest of rex )? #we dont have to have a filename ( #capture the extension \. # they start with dots you know [^?]* # any letter that arent a or ? )? #they are optional you know
      Anyway, Jazz thanks for the analysis, I didnt know that bit about the @valid_extensions in File::Basename. Yves

      --
      You are not ready to use symrefs unless you already know why they are bad. -- tadmc (CLPM)

Re: Regex to match file extension in URL
by jplindstrom (Monsignor) on Sep 09, 2001 at 16:23 UTC
    How about:

    $myvar =~ /\.html$/;

    The $ will anchor the match to the end of the string.

    If that wasn't what you wanted, could you provide examples of what you want as a result of the match?

    /J

      Sorry if that wasn't clear. What I need is a regex to match the extension of the remote file, whatever it is, not just if it's .html. The $myvar was just an example.
        If that's the case, then how about
        if ($myvar =~ m/\.([^.]+)$/) { print "Matched $1"; }
        Of course, this won't work for URL's with an implicit filename, like "http://www.yahoo.com" or "http://www.somewhere.com/home/" You'll have to catch those bad boys elsewhere in your code.

        Gary Blackburn
        Trained Killer

(crazyinsomniac) Re: Regex to match file extension in URL
by crazyinsomniac (Prior) on Sep 10, 2001 at 09:10 UTC
    Since you are only matching a "." and a "/", and you don't particularly care what surrounds them, you don't need a regular expression.

    Its a job for good old fashioned substr and rindex

    #!/usr/bin/perl -w use strict; my $URK = 'http:/://blah.foo.comwhatever/dir/file.extensionelarocko?qu +ery=blah&fckj=ekjl'; my $last_slash = rindex $URK, '/'; my $last_dot = rindex $URK, '.'; my $query = rindex $URK, '?'; print "$URK\n\n"; if( ($last_slash >= 0) and ($last_dot >= 0) ) { printf "%35.35s: %s\n\n", "looks like the file name is", substr( $URK, $last_slash + 1); printf "%35.35s: %s\n\n", "and the extension is", substr($URK,$last_dot + 1); } else { print "seems like we got index.something on our hands\n\n"; } if($query >= 0) { printf "%35.35s: %s\n\n", "We even got a query string, whoa", substr($URK, $query + 1); printf "%35.35s: %s\n\n", "so the true filename would be", substr ( $URK , $last_slash + 1, $query - $last_slash ); printf "%35.35s: %s\n\n", "and the true file extension would be", substr ( $URK , $last_dot + 1, $query - $last_dot - 1 ); } __END__ =head1 RESULTS http:/://blah.foo.comwhatever/dir/file.extensionelarocko?query=blah&fc +kj=ekjl looks like the file name is: file.extensionelarocko?query=blah +&fckj=ekjl and the extension is: extensionelarocko?query=blah&fckj +=ekjl We even got a query string, whoa: query=blah&fckj=ekjl so the true filename would be: file.extensionelarocko? and the true file extension would b: extensionelarocko =cut
    Looks like what you asked for to me.

    Also, CGI.pm has regexes that will give you all kinds of good stuff from the query string/request url .... you can either use CGI.pm or steal the code from the module depending on your needs(script_name() path_translated() path_info()).

    ## AND IF YOU WANNA CHECK FOR A VALID URL, YOU REALLY NEED A MODULE (U +RL::URI) ## BUT substr and rindex are still the best for the job my $url = 'proto://domain.something/dir/file.extension'; my $protocol = substr $url, 0, index($url, '://'),''; ## yada yada yada, you get the point

    However, a regular expression might be "easier" to digest, something along the lines (like "others" have already shown)

    my $url = 'ptoto://foo.combarz.erk/file.ext?query'; my ($proto, $domain, $filedirquery) = $url =~ m|(\w{2,6})://([.a-zA-Z0-9-]+/)(.*?)$|; print "($proto, $domain, $filedirquery)\n";
    update: Thu Sep 13 10:26:58 2001 GMT
    demerphq: has some valid points. My regex obviously isn't complete, and my "substr & index" solution, which I say is the way to tackle this, isn't "validating" and doesn't handle all the possible cases, but then again, it doesn't look like it does. I reccommended using CGI.pm ... for simply getting the file extension from a url, ignoring the possibility of a querystring, and assuming that the filename is in the name.extension format, print substr 'file.htm', 1 + rindex 'file.htm', '.'; cannot be beat.

    Anyway, lots of good reading in this thread.

     
    ___crazyinsomniac_______________________________________
    Disclaimer: Don't blame. It came from inside the void

    perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"

Re: Regex to match file extension in URL
by Amoe (Friar) on Sep 10, 2001 at 00:52 UTC
    Many thanks to all who replied to this node. All your answers have been very helpful (demerphq how long did you spend on yours? ++ :)