Amoe has asked for the wisdom of the Perl Monks concerning the following question:

Quick question here: I need a regex to extract the file selector thing from a uri. So if the uri was: http://www.myurl.com/myurl/foo/bar/zip100.html I'd need to get zip100.html. I've used all my regex skills (which don't extend to much) and have failed to come up with an answer, so could you guys enlighten me?

Replies are listed 'Best First'.
Re: Regexes and URIs
by japhy (Canon) on Mar 29, 2001 at 21:49 UTC
    I would suggest using the URI module to do this safely. Otherwise, you run the risk of not matching it correctly. I don't think this method will work for all cases:
    ($file) = $URI =~ m{ ^ (?: https? | ftp ) :// # scheme [^/]+ # domain (?: / [^/?#]* )* # directories / ( [^/?#]* ) # filename (?: $ | [?#] ) }x;
    I do not advocate using the regex I just made. I didn't even test it. I doubt it works reliably.

    japhy -- Perl and Regex Hacker
      Indeed. If you make the mistake of using that RE you will have broken code. It will look right to you. It will work in your tests. But if someone like me comes along who knows how to put names and passwords in URLs, it will break and I won't be happy.

      Put names and passwords in URLs? Most people don't know that you can do that. But try it:

      http://name:password@www.company.com/whatever/to/get.html
      Substitute in a name and password you use. Substitute in a protocol like ftp if that is easier. Give it a shot from your browser, LWP::Simple, etc.

      This pattern is in the spec. It will work with any tool that I have ever tried. It will work with every protocol. If it does not work with your tool, then that is a bug.

      This is why japhy would have used the standard library. He doesn't know the spec off of the top of his head. He knows he doesn't. And rather than finding it and having to figure out how to do the whole thing correctly, he can just use an existing library and be confident that it will Just Work. By contrast his off-the-cuff solution will work for 99% of the domain space, but (exactly as he predicted) will break somewhere...

      The goal is be right with as little work as possible. So use the module.

Re: Regexes and URIs
by mr.nick (Chaplain) on Mar 29, 2001 at 23:28 UTC
    You want File::Basename.
    use File::Basename; use URI::Escape; my $url='http://www.myurl.com/myurl/foo/bar/zip100.html'; my $base=basename uri_unescape $url;

    note: I added the URI::Escape to make sure the original URL was properly decoded before passing it off to basename.
(jeffa) Re: Regexes and URIs
by jeffa (Bishop) on Mar 29, 2001 at 21:51 UTC
    For your example this would work:
    /\/(\w+\.\w+)$/
    Things get trickier if you have parameters after the file, or even anchors - you could get lazy and use Dot Star:
    /\/(\w+\.\w+)((\?|#|\/).*)?$/;
    But I am sure there is a better way , japhy's reference to URI is probably the best - explore!

    Jeff

    R-R-R--R-R-R--R-R-R--R-R-R--R-R-R--
    L-L--L-L--L-L--L-L--L-L--L-L--L-L--
    
Re: Regexes and URIs
by alfie (Pilgrim) on Mar 29, 2001 at 21:56 UTC
    Do you have the complete URI in a single string or do you have it in a text? The first is quite simple:
    ($file) = $uri =~ /^.*\/(.*)$/;
    You might end up with an empty $file if the URI ends in a / (and therefore uses the default index file).

    The later is more complicate, for you have to keep different things in mind:

    • You should change the class of allowed characters to something more strict, to just the allowed characters that are defined in the URI RFC.
    • On the other hand people tend to put allowed URI-characters behind their URIs. Usual character in that range include the dot (`.'), closing brackets (`)') and coma (`,'). Also quotation marks fall in that range. So you should add a 1-character wide character class at the end that doesn't include those, too.
    There might be other things that I haven't thought about yet, but I hope you get the image...
    --
    Alfie