baurel has asked for the wisdom of the Perl Monks concerning the following question:

hi
I'm new to Perl and looking for solutions to access some parts of an URL. I've googled a lot and searched the forum of Perl Monks but couldn't find a solution. My problem is the following:

For example two urls:
- http://foo.com/downloads/pkg-5.6.tar.gz
- http://bar.com/list/file-5.6-win32.zip


I'm interested in the following parts:

- just the filename like:
-- pkg-5.6.tar.gz
-- file-5.6-win32.zip
and
- just the filename without the extension(zip, tar.gz) like:
-- pkg-5.6
-- file-5.6-win32

How can this be done with out the help of any modules like use File::Basename?

Sorry that I don't come up with a proposition - I'm still working on it.
thanks
Ben

Replies are listed 'Best First'.
Re: Splitting an url in its components
by moritz (Cardinal) on Jul 17, 2008 at 11:45 UTC
    Usually it's best to use a module like Regexp::Common::URI or URI or Rose::URI to extract the path information.

    If you are only interested in anything after the last slash, a regex should work fine:

    if ($uri =~ m{([^/]+)\z}{ print $1, $/; }

    Stripping the extension is a bit harder, because first you have to define what an extension is. If you just want to split off everything from the last period to the end, use a regex like this:

    $filename =~ s/\.[^.]+\z//;

    That will give you pkg-5.6.tar in the first example, which is technically correct, because you have a .tar file inside a .gz file. If you don't like that outcome, specify how the recognition of the extension should work.

      hi
      thanks alot!

      Could you tell me the solution

      filename input: pkg-5.6.tar.gz

      to

      filename output: pkg-5.6

      Regular expressions are really hard for me ... I manage to read your solutions (more or less) but I'm not able to write my own yet :-(
        Well, you can special-case it, but then it'll only work for .tar.gz:
        s{\.(?:tar\.gz|[^.]+)\z}{};

        The problem is that, in general, you can't know from a file name which part of it is "extension" and which part is not, unless you either have a clear-cut definition of what "extension" means (I don't know any that satisfies your requirement), or you have a list of all possible extensions.

        Both my solutions (below) treat every dot-followed-by-things-that-are-word-chars-but-arent-just-numbers as extensions, so they'll get both your cases pkg-5.6.tar.gz and pkg-5.6-win32.zip right
        []s, HTH, Massa
Re: Splitting an url in its components
by Your Mother (Archbishop) on Jul 17, 2008 at 17:46 UTC

    What moritz said initially. URI is core and it's really simple and flexible once you get the hang of it. Solving problems with regexes in defined formats (URI, HTML, XML, whatever) is almost always a mistake.

    use URI; my @uri = qw( http://foo.com/downloads/pkg-5.6.tar.gz http://bar.com/list/file-5.6-win32.zip ); for my $raw ( @uri ) { my $uri = URI->new($raw); printf("%s\n - %s\n - %s\n - %s\n\n", $uri, $uri->path, ( $uri->path_segments )[-1], $uri->host, ); } -- http://foo.com/downloads/pkg-5.6.tar.gz - /downloads/pkg-5.6.tar.gz - pkg-5.6.tar.gz - foo.com http://bar.com/list/file-5.6-win32.zip - /list/file-5.6-win32.zip - file-5.6-win32.zip - bar.com
Re: Splitting an url in its components
by scorpio17 (Canon) on Jul 17, 2008 at 13:23 UTC

    Here's a solution using "split":

    use strict; my @urls = qw( http://foo.com/downloads/pkg-5.6.tar.gz http://bar.com/list/file-5.6-win32.zip ); for my $url (@urls) { my @stuff = split(/\//, $url); # split on '/' my $file = pop @stuff; # take item off the 'right' side print "file = $file\n"; my @stuff2 = split(/\./, $file); # split on '.' my $file2 = shift @stuff2; # take item of 'left' side print "file2 = $file2\n"; }
Re: Splitting an url in its components
by leocharre (Priest) on Jul 17, 2008 at 14:02 UTC
    This is an imperfect example of doing this via regexes.
    #!/usr/bin/perl use strict; my @urls = qw(http://foo.com/downloads/pkg-5.6.tar.gz http://bar.com/list/file-5.6-win32.zip example.org/subd/golf.txt http://digg.com/page4 http://www.cnn.com/2008/US/07/17/beck.che.guevara/index.html ); for my $url (@urls){ print STDERR "URL: $url\n"; my $segment = url_segments($url); for my $key ( sort keys %$segment ){ my $val = $segment->{$key}; print STDERR " - $key : $val\n"; } print STDERR "\n"; } exit; sub url_segments { my $url = shift; my %segment = ( url => $url, domain => undef, location => undef, location_relative => undef, filename => undef, filename_only => undef, extension => undef, query_string => undef, protocol => undef, ); if ( $url=~s/(\?.*)$// ) {# take out possible query string $segment{query_string} = $1; } if ( $url=~s/^([a-z]{3,5})\:\/\///i ){ $segment{protocol} = $1; } if ( $url=~s/^([\w\.]+)// ){ $segment{domain} = $1; } if ( $url=~s/([^\/]+)\.([a-z0-9]{1,5})$// ){ $segment{filename} = "$1.$2"; $segment{filename_only} = $1; $segment{extension} = $2; } if ( $url ){ $segment{location_relative} = $url; $segment{location} = $segment{domain}.$segment{location_relative +}; } return \%segment; }
    # output:
    URL: http://foo.com/downloads/pkg-5.6.tar.gz
     - domain : foo.com
     - extension : gz
     - filename : pkg-5.6.tar.gz
     - filename_only : pkg-5.6.tar
     - location : foo.com/downloads/
     - location_relative : /downloads/
     - protocol : http
     - query_string :
     - url : http://foo.com/downloads/pkg-5.6.tar.gz
    
    URL: http://bar.com/list/file-5.6-win32.zip
     - domain : bar.com
     - extension : zip
     - filename : file-5.6-win32.zip
     - filename_only : file-5.6-win32
     - location : bar.com/list/
     - location_relative : /list/
     - protocol : http
     - query_string :
     - url : http://bar.com/list/file-5.6-win32.zip
    
    URL: example.org/subd/golf.txt
     - domain : example.org
     - extension : txt
     - filename : golf.txt
     - filename_only : golf
     - location : example.org/subd/
     - location_relative : /subd/
     - protocol :
     - query_string :
     - url : example.org/subd/golf.txt
    
    URL: http://digg.com/page4
     - domain : digg.com
     - extension :
     - filename :
     - filename_only :
     - location : digg.com/page4
     - location_relative : /page4
     - protocol : http
     - query_string :
     - url : http://digg.com/page4
    
    URL: http://www.cnn.com/2008/US/07/17/beck.che.guevara/index.html
     - domain : www.cnn.com
     - extension : html
     - filename : index.html
     - filename_only : index
     - location : www.cnn.com/2008/US/07/17/beck.che.guevara/
     - location_relative : /2008/US/07/17/beck.che.guevara/
     - protocol : http
     - query_string :
     - url : http://www.cnn.com/2008/US/07/17/beck.che.guevara/index.html
    
Re: Splitting an url in its components
by massa (Hermit) on Jul 17, 2008 at 15:47 UTC
    If you _really_ want to reinvent the wheel, you can do with a single regex like:
    use strict; use warnings; sub url_segments { local($_) = @_; my $segments = { url => $_ }; return unless m{ \A (?: ([a-z]{3,6}) : (?://)? )? # proto ( [:a-z0-9@._-]+ ) # domain (?: ( .*? / ) # location ( ( [^/]*? ) # filename_only ( (?: \. \w*[a-z]\w* )* ) # ext ) (?: ( \? .* ) )? # query )? \Z }x; $segments->{proto} = $1 if $1; $segments->{domain} = $2 if $2; $segments->{location} = $3 if $3; $segments->{filename} = $4 if $4; $segments->{filename_only} = $5 if $5; $segments->{ext} = $6 if $6; $segments->{query} = $7 if $7; $segments } use YAML; print Dump( { map { $_ => url_segments $_ } qw( http://foo.com/downloads/pkg-5.6.tar.gz http://bar.com/list/file-5.6-win32.zip example.org/subd/golf.txt http://digg.com/page4 http://digg.com/page4?format=printable http://www.cnn.com/2008/US/07/17/beck.che.guevara/index.html https://tomwtriker:imzadi@coldmail.net/goethe.3.5.tar mailto:tomwtriker@coldmail.net ) } )
    Result:
    --- example.org/subd/golf.txt: domain: example.org ext: .txt filename: golf.txt filename_only: golf location: /subd/ url: example.org/subd/golf.txt http://bar.com/list/file-5.6-win32.zip: domain: bar.com ext: .zip filename: file-5.6-win32.zip filename_only: file-5.6-win32 location: /list/ proto: http url: http://bar.com/list/file-5.6-win32.zip http://digg.com/page4: domain: digg.com filename: page4 filename_only: page4 location: / proto: http url: http://digg.com/page4 http://digg.com/page4?format=printable: domain: digg.com filename: page4 filename_only: page4 location: / proto: http query: '?format=printable' url: http://digg.com/page4?format=printable http://foo.com/downloads/pkg-5.6.tar.gz: domain: foo.com ext: .tar.gz filename: pkg-5.6.tar.gz filename_only: pkg-5.6 location: /downloads/ proto: http url: http://foo.com/downloads/pkg-5.6.tar.gz http://www.cnn.com/2008/US/07/17/beck.che.guevara/index.html: domain: www.cnn.com ext: .html filename: index.html filename_only: index location: /2008/US/07/17/beck.che.guevara/ proto: http url: http://www.cnn.com/2008/US/07/17/beck.che.guevara/index.html https://tomwtriker:imzadi@coldmail.net/goethe.3.5.tar: domain: tomwtriker:imzadi@coldmail.net ext: .tar filename: goethe.3.5.tar filename_only: goethe.3.5 location: / proto: https url: https://tomwtriker:imzadi@coldmail.net/goethe.3.5.tar mailto:tomwtriker@coldmail.net: domain: tomwtriker@coldmail.net proto: mailto url: mailto:tomwtriker@coldmail.net
    []s, HTH, Massa
      (replying to self, duh)
      Here is a more complete regex:
      sub url_segments { local($_) = @_; my $segments = { url => $_ }; my @matches = m{ \A (?: ([a-z]{3,6}) # proto : (?: // )? )? (?: (?: ([a-z0-9_-]+) # user (?: : ([^@]+) ) @ # password )? ( [a-z0-9._-]+ ) # domain (?: : (\d+) )? # port ) (?: ( .*? / ) # location ( ( [^/]*? ) # filename_only ( ( \. \w*[a-z]\w* )* ) # ext ) (?: \? ( .* ) )? # query )? \Z }xi; return unless @matches; my @parts = qw(proto user pass domain port location filename filename_only ext query); $matches[$_] and $segments->{$parts[$_]} = $matches[$_] for 0 .. $#parts; $segments }
      []s, HTH, Massa
      Wow, that's pretty freaking elegant.
      I like this example more than the one after, because it's legible.
      Nice YAML touch- Thank you for showing us!
        At your service... :-)
        I like the first example better, too, but the second one gets more parts of the URI separated... and that's why I had to break it in so many lines! As for YAML, it's my favourite debugging tool (I think it's far more legible than Data::Dump(er)?)
        []s, HTH, Massa