Regex to match file extension in URL

Amoe has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Regex to match file extension in URL
by demerphq (Chancellor) on Sep 09, 2001 at 19:05 UTC

D:\Temp\file.ext.bak
E:\Code\junk.stuff.ext
[download]

Anyway enough muttering, the basic way to attack a regex is to not think what you do want to match but to think about what you don't</code> want to match. In my experience when I write a regex based on what I want to match I end up matching all of things I intended to as well as a few that I didn't.

So lets generalize your question

           # not all of these are valid urls
           # nor do I want all of them to match
           # For instance some invalids I DO want to match
           # but handle their invalidity later
my @url=qw(
           http://www.foobar.com
           http://www.foobar.com/foo
           http://www.foobar.com/foo/
           http://www.foobar.com/foo.pl
           http://www.foobar.com/.extension
           http://www.foobar.com?test
           http://www.foobar.com/foo?test
           http://www.foobar.com/foo/?test
           http://www.foobar.com/foo.pl?
           http://www.foobar.com/.extension?
           http://www.foobar.com/foo/bar/foobar.html
           http://www.foobar.com/foo/bar/foo.bar.html
           http://perlmonks.com/index.pl?node_id=68135
           http://perlmonks.com/index.pl??node_id=68135
           http:///file.ext?
           http:///.ext?
           http:///file.ext
           http:///.ext
           );
[download]

last

Site : Match after the 'http://' everything that doesnt have '/' in it. (or up to the '/')
Path : everything that begins in '/' and ends in a '/' including '/' itself
File* :
- Everthing that doesnt include a dot, slash or question mark
- or Everything up to the last dot that doesnt include a slash or a questionmark
Extension: everything including dots that doesnt include a question mark
Params: everything from and including the question mark to the right

 
foreach (@url) {
    if (my @parts=m!^http://   #must begin http://
                  (            #capture the site
                     [^/?]+    #  site has no / or ? in it
                  )            #its mandatory

                  (            #capture the path
                     /         #  starts with a /
                     (?:       #  group but dont capture
                        [^/?]+ #    anything but / or ?
                        /      #    followed by a /
                     )*        #  zero or more times (opt)
                  )?           #all optional

                  (            #capture the filename
                      [^./?]   #  doesnt start with a . or ? or /
                      [^/?]+?  #  all chars not / or ? , (ctd.)
                               #    --leave stuff for rest of rex
                   )?          #we dont have to have a filename

                  (            #capture the extension
                      \.       #  they start with dots you know
                      [^.?]*   #  any letter that arent a . or ?
                   )?          #we dont need an extension really

                  (            #capture a parameter string
                      \?       #  it starts with a ?
                      .*       #  and has any char following
                  )?           #but its optional too..
                  $            #and thats the end folks...
                !x) {          #ignore comments and whitespace in rex
        print "$_\t".join(',',@parts)."\n"; # weve matched now print
    } else {
        print "NOMATCH:$_\n"; #oops, is this ok?
    }
} #lets try the next URL and see if we do better....
# :)
[download]

URL	Site	Path	File	Ext	Params
http://www.foobar.com	www.foobar.com	-	-	-	-
http://www.foobar.com/foo	www.foobar.com	/	foo	-	-
http://www.foobar.com/foo/	www.foobar.com	/foo/	-	-	-
http://www.foobar.com/foo.pl	www.foobar.com	/	foo	.pl	-
http://www.foobar.com/.extension	www.foobar.com	/	-	.extension	-
http://www.foobar.com?test	www.foobar.com	-	-	-	?test
http://www.foobar.com/foo?test	www.foobar.com	/	foo	-	?test
http://www.foobar.com/foo/?test	www.foobar.com	/foo/	-	-	?test
http://www.foobar.com/foo.pl?	www.foobar.com	/	foo	.pl	?
http://www.foobar.com/.extension?	www.foobar.com	/	-	.extension	?
http://www.foobar.com/foo/bar/foobar.html	www.foobar.com	/foo/bar/	foobar	.html	-
http://www.foobar.com/foo/bar/foo.bar.html	www.foobar.com	/foo/bar/	foo.bar	.html	-
http://perlmonks.com/index.pl?node_id=68135	perlmonks.com	/	index	.pl	?node_id=68135
http://perlmonks.com/index.pl??node_id=68135	perlmonks.com	/	index	.pl	??node_id=68135
http:///file.ext	NOMATCH
http:///.ext	NOMATCH
http:///file.ext?	NOMATCH
http:///.ext?	NOMATCH

Ok, Ok, so its not the answer to your exact question.... :-)

Hopefully though theres enough stuff here to help you sort out your problem. Good luck!

Yves
--
You are not ready to use symrefs unless you already know why they are bad. -- tadmc (CLPM)

[reply]
[d/l]
[select]

Re: Regex to match file extension in URL
by enoch (Chaplain) on Sep 09, 2001 at 18:37 UTC

#!/usr/bin/perl -w
use strict;

my $var = 'http://www.foo.bar/baz/some.html';

($var =~ /
            ^             # match the beginning
            [\w|\.|:|\/]+ # match chars, '.', '/', or ':'
            \w+\.         # match file name followed by dot
            (\w+(?=$))    # match extension if followed
                          # by the end of line 
        /x)
, my $match = $1;

print $match . "\n";
[download]

[reply]
[d/l]

Re: Regex to match file extension in URL
by Jazz (Curate) on Sep 09, 2001 at 23:50 UTC

File::Basename

demerphq

regex (posted above)

#!/usr/bin/perl

use File::Basename;
use strict;

my @files = (
    'http://server.com/subdir/index.html',
    'http://server.com/subdir/dist.tar.gz',
    'http://server.com/whatever.cgi?testing=1',
    'ftp://server.com/pub/whatever.zip',
    'file://local/subdir/testing.txt',
);


foreach my $file ( @files ){

    my $suffix = ( fileparse( $file, '\..*$' ) )[2];
    $suffix =~ s/(\.?[^.?]*)?\?.*?$/$1/;

    print $suffix, "\n";
}
[download]

Note that this code will not handle multi-level extensions, such as .tar.gz. The extension for dist.tar.gz will be reported as .gz (same deal with demerphq's code).

For extensions of this type, you'll probably need to create an array that's propagated with valid file extensions. Coincidentally, you can throw this array at File::Basename to easily ignore invalid extensions. Example:

my @valid_extensions = qw/ .tar.gz .html .zip /;

foreach my $file ( @files ){

    my $suffix = ( fileparse( $file, @valid_extensions ) )[2];

    print $suffix, "\n";
}
[download]

@valid_extensions

Jasmine

[reply]
[d/l]
[select]

Re: Regex to match file extension in URL -- Bundled Extensions

by demerphq (Chancellor) on Sep 10, 2001 at 14:14 UTC

Jazz

This alternative uses File::Basename to extract the filename and the query string.

Hey! Thats cheating! :-)
No just kidding. Actually you are very right. Using File::Basename is much better than using a roll your own regex, you are much less likely to find the rex doesnt work on some strange OS, and that some of the weirder cases are propperly handled. (For instance a really robust regex would match BOTH / and \'s) OTOH it _is_ an worthy educational process to learn how to do this. Tokenizing filenames with a regex is not a trivial exercise and IMHO therefore makes a good learning opportunity.

The non-trivial nature of tokenizing such a string is illustrated incidentally in the post by crazyinsomniac. Now this is a senior monk, with undoubtadly considerable experience, yet clearly he didn't examine too many cases with either his substr/index solution, nor with his regex solution. When I run his solutions against my earlier posted testdata I get some perverse results indeed. (The regex and substr version dont even produce the same results)

# selected results of CrazyInsomniacs Substr impl.
# doubles pacining converted to single by me.
http://perlmonks.com/index.pl?node_id=68135
        looks like the file name is: index.pl?node_id=68135
               and the extension is: pl?node_id=68135
   We even got a query string, whoa: node_id=68135
      so the true filename would be: index.pl?
and the true file extension would b: pl

http://www.foobar.com/foo/
        looks like the file name is: 
               and the extension is: com/foo/

http://www.foobar.com/foo?test
        looks like the file name is: foo?test
               and the extension is: com/foo?test
   We even got a query string, whoa: test
      so the true filename would be: foo?
and the true file extension would b: com/foo

http:///file.ext
        looks like the file name is: file.ext
               and the extension is: ext

#Selected results of CrazyInsomniacs regex implementation
#input string added by me
http://www.foobar.com
(, , )
(http, www.foobar.com/, foo/bar/foobar.html)
http://www.foobar.com/foo/bar/foo.bar.html
http:///file.ext
(, , )
[download]

wrong!

Note that this code will not handle multi-level extensions, such as .tar.gz

Ahh yes. Originally, as can be seen from the list I provided in my OP, I intended to post two solutions, one along the MS type lines one along a more natural 'bundled' extension line. However I got a bit distracted by using CGI to output that table (yes it took me a while Amoe but thats ok, I was using it to learn basic cgi) and completely forgot to post the other solution. :-)

So in penance I offer the two variants of the above regex. One will return all of the extensions bundled together, the other will return ONLY the last two or less extensions. This second variant could easily be modified for whatever level of bundling is required. I havent included the full regex, these two snippets should fit in place over my earlier filename part and extension part leaving the other parts untouched.

# regex snippet for matching
# at most two bundled extensions
# foobar.gzip      -> foobar,.gzip
# foobar.tar.gzip  -> foobar,.tar.gzip
# foo.bar.tar.gzip -> foo.bar,.tar.gzip
# the snippt should paste into place over 
# my earlier matches for filename and extension
                  (            #capture the filename
                      [^./?]   #  doesnt start with a . or ? or /
                      [^/?]+?  #  all chars not / or ? , (ctd.)
                               #    --leave stuff for rest of rex
                   )?          #we dont have to have a filename

                  (            #capture the extension
                     (?:       #  Group but dont capture
                        \.     #     they start with dots you know
                        [^?.]* #     any letter that arent a . or ?
                     ){0,2}    #  anywhere from 0 to 2 exts please.
                   )           #thanks..

# regex snippet for matching
# filename and all bundled extensions
# foobar.gzip      -> foobar,.gzip
# foobar.tar.gzip  -> foobar,.tar.gzip
# foo.bar.tar.gzip -> foo,.bar.tar.gzip
# the snippt should paste into place over 
# my earlier matches for filename and extension
                  (            #capture the filename
                      [^./?]   #  doesnt start with a . or ? or /
                      [^/?.]+? #  all chars not / or ? or. 
                               #    --leave stuff for rest of rex
                   )?          #we dont have to have a filename

                  (            #capture the extension
                      \.       #   they start with dots you know
                      [^?]*    #   any letter that arent a  or ?
                   )?          #they are optional you know
[download]

Jazz

--
You are not ready to use symrefs unless you already know why they are bad. -- tadmc (CLPM)

[reply]
[d/l]
[select]

Re: Regex to match file extension in URL
by jplindstrom (Monsignor) on Sep 09, 2001 at 16:23 UTC

$myvar =~ /\.html$/;

The $ will anchor the match to the end of the string.

If that wasn't what you wanted, could you provide examples of what you want as a result of the match?

[reply]
[d/l]

Re: Re: Regex to match file extension in URL

by Amoe (Friar) on Sep 09, 2001 at 16:26 UTC

Sorry if that wasn't clear. What I need is a regex to match the extension of the remote file, whatever it is, not just if it's .html. The $myvar was just an example.

[reply]

Re: Re: Re: Regex to match file extension in URL

by Trimbach (Curate) on Sep 09, 2001 at 16:44 UTC

if ($myvar =~ m/\.([^.]+)$/) {
print "Matched $1";
}
[download]

Gary Blackburn
Trained Killer

[reply]
[d/l]

Re: Re: Re: Re: Regex to match file extension in URL

by Amoe (Friar) on Sep 09, 2001 at 17:15 UTC

Re: Re: Re: Re: Re: Regex to match file extension in URL

by Trimbach (Curate) on Sep 09, 2001 at 19:07 UTC

(crazyinsomniac) Re: Regex to match file extension in URL
by crazyinsomniac (Prior) on Sep 10, 2001 at 09:10 UTC

Its a job for good old fashioned substr and rindex

#!/usr/bin/perl -w
use strict;
my $URK = 'http:/://blah.foo.comwhatever/dir/file.extensionelarocko?qu
+ery=blah&fckj=ekjl';
my $last_slash = rindex $URK, '/';
my $last_dot = rindex $URK, '.';
my $query = rindex $URK, '?';

print "$URK\n\n";
if( ($last_slash >= 0) and ($last_dot >= 0) )
{
    printf "%35.35s: %s\n\n",
           "looks like the file name is",
           substr( $URK, $last_slash + 1);
    printf "%35.35s: %s\n\n",
           "and the extension is",
            substr($URK,$last_dot + 1);
}
else
{
    print "seems like we got index.something on our hands\n\n";
}

if($query >= 0)
{
    printf "%35.35s: %s\n\n",
           "We even got a query string, whoa",
           substr($URK, $query + 1);

    printf "%35.35s: %s\n\n",
    "so the true filename would be",
    substr ( $URK , $last_slash + 1, $query - $last_slash );

    printf "%35.35s: %s\n\n",
    "and the true file extension would be",
    substr ( $URK , $last_dot + 1, $query - $last_dot - 1 );
}

__END__

=head1 RESULTS

http:/://blah.foo.comwhatever/dir/file.extensionelarocko?query=blah&fc
+kj=ekjl

        looks like the file name is: file.extensionelarocko?query=blah
+&fckj=ekjl


               and the extension is: extensionelarocko?query=blah&fckj
+=ekjl

   We even got a query string, whoa: query=blah&fckj=ekjl

      so the true filename would be: file.extensionelarocko?

and the true file extension would b: extensionelarocko

=cut
[download]

Also, CGI.pm has regexes that will give you all kinds of good stuff from the query string/request url .... you can either use CGI.pm or steal the code from the module depending on your needs(script_name() path_translated() path_info()).

## AND IF YOU WANNA CHECK FOR A VALID URL, YOU REALLY NEED A MODULE (U
+RL::URI)
## BUT substr and rindex are still the best for the job
my $url = 'proto://domain.something/dir/file.extension';
my $protocol = substr $url, 0, index($url, '://'),'';
## yada yada yada, you get the point
[download]

However, a regular expression might be "easier" to digest, something along the lines (like "others" have already shown)

my $url = 'ptoto://foo.combarz.erk/file.ext?query';
my ($proto, $domain, $filedirquery) =
$url =~
m|(\w{2,6})://([.a-zA-Z0-9-]+/)(.*?)$|;

print "($proto, $domain, $filedirquery)\n";
[download]

update:

demerphq

print substr 'file.htm', 1 + rindex 'file.htm', '.';

Anyway, lots of good reading in this thread.

___crazyinsomniac_______________________________________
Disclaimer: Don't blame. It came from inside the void
perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"

[reply]
[d/l]
[select]

Re: Regex to match file extension in URL
by Amoe (Friar) on Sep 10, 2001 at 00:52 UTC

demerphq

[reply]