Splitting an url in its components

baurel has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Splitting an url in its components by moritz (Cardinal) on Jul 17, 2008 at 11:45 UTC
Usually it's best to use a module like Regexp::Common::URI or URI or Rose::URI to extract the path information. If you are only interested in anything after the last slash, a regex should work fine: `if ($uri =~ m{([^/]+)\z}{ print $1, $/; }` [download] Stripping the extension is a bit harder, because first you have to define what an extension is. If you just want to split off everything from the last period to the end, use a regex like this: `$filename =~ s/\.[^.]+\z//;` [download] That will give you pkg-5.6.tar in the first example, which is technically correct, because you have a .tar file inside a .gz file. If you don't like that outcome, specify how the recognition of the extension should work.	[reply] [d/l] [select]
Re^2: Splitting an url in its components by baurel (Sexton) on Jul 17, 2008 at 13:34 UTC
hi thanks alot! Could you tell me the solution filename input: pkg-5.6.tar.gz to filename output: pkg-5.6 Regular expressions are really hard for me ... I manage to read your solutions (more or less) but I'm not able to write my own yet :-(	[reply]
Re^3: Splitting an url in its components by moritz (Cardinal) on Jul 17, 2008 at 15:15 UTC
Well, you can special-case it, but then it'll only work for .tar.gz: `s{\.(?:tar\.gz\|[^.]+)\z}{};` [download] The problem is that, in general, you can't know from a file name which part of it is "extension" and which part is not, unless you either have a clear-cut definition of what "extension" means (I don't know any that satisfies your requirement), or you have a list of all possible extensions.	[reply] [d/l]
Re^4: Splitting an url in its components by baurel (Sexton) on Jul 17, 2008 at 15:42 UTC
Re^5: Splitting an url in its components by moritz (Cardinal) on Jul 17, 2008 at 15:48 UTC
Re^5: Splitting an url in its components by Anonymous Monk on Jul 18, 2008 at 04:44 UTC
Re^3: Splitting an url in its components by massa (Hermit) on Jul 17, 2008 at 22:08 UTC
Both my solutions (below) treat every dot-followed-by-things-that-are-word-chars-but-arent-just-numbers as extensions, so they'll get both your cases pkg-5.6.tar.gz and pkg-5.6-win32.zip right []s, HTH, Massa	[reply]
Re: Splitting an url in its components by Your Mother (Archbishop) on Jul 17, 2008 at 17:46 UTC
What moritz said initially. URI is core and it's really simple and flexible once you get the hang of it. Solving problems with regexes in defined formats (URI, HTML, XML, whatever) is almost always a mistake. `use URI; my @uri = qw( http://foo.com/downloads/pkg-5.6.tar.gz http://bar.com/list/file-5.6-win32.zip ); for my $raw ( @uri ) { my $uri = URI->new($raw); printf("%s\n - %s\n - %s\n - %s\n\n", $uri, $uri->path, ( $uri->path_segments )[-1], $uri->host, ); } -- http://foo.com/downloads/pkg-5.6.tar.gz - /downloads/pkg-5.6.tar.gz - pkg-5.6.tar.gz - foo.com http://bar.com/list/file-5.6-win32.zip - /list/file-5.6-win32.zip - file-5.6-win32.zip - bar.com` [download]	[reply] [d/l]
Re: Splitting an url in its components by scorpio17 (Canon) on Jul 17, 2008 at 13:23 UTC
Here's a solution using "split": `use strict; my @urls = qw( http://foo.com/downloads/pkg-5.6.tar.gz http://bar.com/list/file-5.6-win32.zip ); for my $url (@urls) { my @stuff = split(/\//, $url); # split on '/' my $file = pop @stuff; # take item off the 'right' side print "file = $file\n"; my @stuff2 = split(/\./, $file); # split on '.' my $file2 = shift @stuff2; # take item of 'left' side print "file2 = $file2\n"; }` [download]	[reply] [d/l]
Re: Splitting an url in its components by leocharre (Priest) on Jul 17, 2008 at 14:02 UTC
This is an imperfect example of doing this via regexes. #!/usr/bin/perl use strict; my @urls = qw(http://foo.com/downloads/pkg-5.6.tar.gz http://bar.com/list/file-5.6-win32.zip example.org/subd/golf.txt http://digg.com/page4 http://www.cnn.com/2008/US/07/17/beck.che.guevara/index.html ); for my $url (@urls){ print STDERR "URL: $url\n"; my $segment = url_segments($url); for my $key ( sort keys %$segment ){ my $val = $segment->{$key}; print STDERR " - $key : $val\n"; } print STDERR "\n"; } exit; sub url_segments { my $url = shift; my %segment = ( url => $url, domain => undef, location => undef, location_relative => undef, filename => undef, filename_only => undef, extension => undef, query_string => undef, protocol => undef, ); if ( $url=~s/(\?.*)$// ) {# take out possible query string $segment{query_string} = $1; } if ( $url=~s/^([a-z]{3,5})\:\/\///i ){ $segment{protocol} = $1; } if ( $url=~s/^([\w\.]+)// ){ $segment{domain} = $1; } if ( $url=~s/([^\/]+)\.([a-z0-9]{1,5})$// ){ $segment{filename} = "$1.$2"; $segment{filename_only} = $1; $segment{extension} = $2; } if ( $url ){ $segment{location_relative} = $url; $segment{location} = $segment{domain}.$segment{location_relative +}; } return \%segment; } [download] # output: URL: http://foo.com/downloads/pkg-5.6.tar.gz - domain : foo.com - extension : gz - filename : pkg-5.6.tar.gz - filename_only : pkg-5.6.tar - location : foo.com/downloads/ - location_relative : /downloads/ - protocol : http - query_string : - url : http://foo.com/downloads/pkg-5.6.tar.gz URL: http://bar.com/list/file-5.6-win32.zip - domain : bar.com - extension : zip - filename : file-5.6-win32.zip - filename_only : file-5.6-win32 - location : bar.com/list/ - location_relative : /list/ - protocol : http - query_string : - url : http://bar.com/list/file-5.6-win32.zip URL: example.org/subd/golf.txt - domain : example.org - extension : txt - filename : golf.txt - filename_only : golf - location : example.org/subd/ - location_relative : /subd/ - protocol : - query_string : - url : example.org/subd/golf.txt URL: http://digg.com/page4 - domain : digg.com - extension : - filename : - filename_only : - location : digg.com/page4 - location_relative : /page4 - protocol : http - query_string : - url : http://digg.com/page4 URL: http://www.cnn.com/2008/US/07/17/beck.che.guevara/index.html - domain : www.cnn.com - extension : html - filename : index.html - filename_only : index - location : www.cnn.com/2008/US/07/17/beck.che.guevara/ - location_relative : /2008/US/07/17/beck.che.guevara/ - protocol : http - query_string : - url : http://www.cnn.com/2008/US/07/17/beck.che.guevara/index.html	[reply] [d/l]
Re: Splitting an url in its components by massa (Hermit) on Jul 17, 2008 at 15:47 UTC
If you _really_ want to reinvent the wheel, you can do with a single regex like: use strict; use warnings; sub url_segments { local($_) = @_; my $segments = { url => $_ }; return unless m{ \A (?: ([a-z]{3,6}) : (?://)? )? # proto ( [:a-z0-9@._-]+ ) # domain (?: ( .? / ) # location ( ( [^/]? ) # filename_only ( (?: \. \w[a-z]\w )* ) # ext ) (?: ( \? .* ) )? # query )? \Z }x; $segments->{proto} = $1 if $1; $segments->{domain} = $2 if $2; $segments->{location} = $3 if $3; $segments->{filename} = $4 if $4; $segments->{filename_only} = $5 if $5; $segments->{ext} = $6 if $6; $segments->{query} = $7 if $7; $segments } use YAML; print Dump( { map { $_ => url_segments $_ } qw( http://foo.com/downloads/pkg-5.6.tar.gz http://bar.com/list/file-5.6-win32.zip example.org/subd/golf.txt http://digg.com/page4 http://digg.com/page4?format=printable http://www.cnn.com/2008/US/07/17/beck.che.guevara/index.html https://tomwtriker:imzadi@coldmail.net/goethe.3.5.tar mailto:tomwtriker@coldmail.net ) } ) [download] Result: --- example.org/subd/golf.txt: domain: example.org ext: .txt filename: golf.txt filename_only: golf location: /subd/ url: example.org/subd/golf.txt http://bar.com/list/file-5.6-win32.zip: domain: bar.com ext: .zip filename: file-5.6-win32.zip filename_only: file-5.6-win32 location: /list/ proto: http url: http://bar.com/list/file-5.6-win32.zip http://digg.com/page4: domain: digg.com filename: page4 filename_only: page4 location: / proto: http url: http://digg.com/page4 http://digg.com/page4?format=printable: domain: digg.com filename: page4 filename_only: page4 location: / proto: http query: '?format=printable' url: http://digg.com/page4?format=printable http://foo.com/downloads/pkg-5.6.tar.gz: domain: foo.com ext: .tar.gz filename: pkg-5.6.tar.gz filename_only: pkg-5.6 location: /downloads/ proto: http url: http://foo.com/downloads/pkg-5.6.tar.gz http://www.cnn.com/2008/US/07/17/beck.che.guevara/index.html: domain: www.cnn.com ext: .html filename: index.html filename_only: index location: /2008/US/07/17/beck.che.guevara/ proto: http url: http://www.cnn.com/2008/US/07/17/beck.che.guevara/index.html https://tomwtriker:imzadi@coldmail.net/goethe.3.5.tar: domain: tomwtriker:imzadi@coldmail.net ext: .tar filename: goethe.3.5.tar filename_only: goethe.3.5 location: / proto: https url: https://tomwtriker:imzadi@coldmail.net/goethe.3.5.tar mailto:tomwtriker@coldmail.net: domain: tomwtriker@coldmail.net proto: mailto url: mailto:tomwtriker@coldmail.net [download] []s, HTH, Massa	[reply] [d/l] [select]
Re^2: Splitting an url in its components by massa (Hermit) on Jul 17, 2008 at 16:35 UTC
(replying to self, duh) Here is a more complete regex: sub url_segments { local($_) = @_; my $segments = { url => $_ }; my @matches = m{ \A (?: ([a-z]{3,6}) # proto : (?: // )? )? (?: (?: ([a-z0-9_-]+) # user (?: : ([^@]+) ) @ # password )? ( [a-z0-9._-]+ ) # domain (?: : (\d+) )? # port ) (?: ( .? / ) # location ( ( [^/]? ) # filename_only ( ( \. \w[a-z]\w )* ) # ext ) (?: \? ( .* ) )? # query )? \Z }xi; return unless @matches; my @parts = qw(proto user pass domain port location filename filename_only ext query); $matches[$_] and $segments->{$parts[$_]} = $matches[$_] for 0 .. $#parts; $segments } [download] []s, HTH, Massa	[reply] [d/l]
Re^2: Splitting an url in its components by leocharre (Priest) on Jul 17, 2008 at 17:42 UTC
Wow, that's pretty freaking elegant. I like this example more than the one after, because it's legible. Nice YAML touch- Thank you for showing us!	[reply]
Re^3: Splitting an url in its components by massa (Hermit) on Jul 17, 2008 at 22:14 UTC
At your service... :-) I like the first example better, too, but the second one gets more parts of the URI separated... and that's why I had to break it in so many lines! As for YAML, it's my favourite debugging tool (I think it's far more legible than Data::Dump(er)?) []s, HTH, Massa	[reply]
Re^4: Splitting an url in its components by Anonymous Monk on Jul 18, 2008 at 04:44 UTC
Re^5: Splitting an url in its components by massa (Hermit) on Jul 18, 2008 at 12:44 UTC