Re: Splitting an url in its components
by moritz (Cardinal) on Jul 17, 2008 at 11:45 UTC
|
Usually it's best to use a module like Regexp::Common::URI or URI or Rose::URI to extract the path information.
If you are only interested in anything after the last slash, a regex should work fine:
if ($uri =~ m{([^/]+)\z}{
print $1, $/;
}
Stripping the extension is a bit harder, because first you have to define what an extension is. If you just want to split off everything from the last period to the end, use a regex like this:
$filename =~ s/\.[^.]+\z//;
That will give you pkg-5.6.tar in the first example, which is technically correct, because you have a .tar file inside a .gz file. If you don't like that outcome, specify how the recognition of the extension should work. | [reply] [d/l] [select] |
|
|
hi
thanks alot!
Could you tell me the solution
filename input: pkg-5.6.tar.gz
to
filename output: pkg-5.6
Regular expressions are really hard for me ... I manage to read your solutions (more or less) but I'm not able to write my own yet :-(
| [reply] |
|
|
Well, you can special-case it, but then it'll only work for .tar.gz:
s{\.(?:tar\.gz|[^.]+)\z}{};
The problem is that, in general, you can't know from a file name which part of it is "extension" and which part is not, unless you either have a clear-cut definition of what "extension" means (I don't know any that satisfies your requirement), or you have a list of all possible extensions. | [reply] [d/l] |
|
|
|
|
|
|
|
|
Both my solutions (below) treat every dot-followed-by-things-that-are-word-chars-but-arent-just-numbers as extensions, so they'll get both your cases pkg-5.6.tar.gz and pkg-5.6-win32.zip right
| [reply] |
Re: Splitting an url in its components
by Your Mother (Archbishop) on Jul 17, 2008 at 17:46 UTC
|
What moritz said initially. URI is core and it's really simple and flexible once you get the hang of it. Solving problems with regexes in defined formats (URI, HTML, XML, whatever) is almost always a mistake.
use URI;
my @uri = qw(
http://foo.com/downloads/pkg-5.6.tar.gz
http://bar.com/list/file-5.6-win32.zip
);
for my $raw ( @uri )
{
my $uri = URI->new($raw);
printf("%s\n - %s\n - %s\n - %s\n\n",
$uri,
$uri->path,
( $uri->path_segments )[-1],
$uri->host,
);
}
--
http://foo.com/downloads/pkg-5.6.tar.gz
- /downloads/pkg-5.6.tar.gz
- pkg-5.6.tar.gz
- foo.com
http://bar.com/list/file-5.6-win32.zip
- /list/file-5.6-win32.zip
- file-5.6-win32.zip
- bar.com
| [reply] [d/l] |
Re: Splitting an url in its components
by scorpio17 (Canon) on Jul 17, 2008 at 13:23 UTC
|
use strict;
my @urls = qw(
http://foo.com/downloads/pkg-5.6.tar.gz
http://bar.com/list/file-5.6-win32.zip
);
for my $url (@urls) {
my @stuff = split(/\//, $url); # split on '/'
my $file = pop @stuff; # take item off the 'right' side
print "file = $file\n";
my @stuff2 = split(/\./, $file); # split on '.'
my $file2 = shift @stuff2; # take item of 'left' side
print "file2 = $file2\n";
}
| [reply] [d/l] |
Re: Splitting an url in its components
by leocharre (Priest) on Jul 17, 2008 at 14:02 UTC
|
This is an imperfect example of doing this via regexes.
#!/usr/bin/perl
use strict;
my @urls = qw(http://foo.com/downloads/pkg-5.6.tar.gz
http://bar.com/list/file-5.6-win32.zip
example.org/subd/golf.txt
http://digg.com/page4
http://www.cnn.com/2008/US/07/17/beck.che.guevara/index.html
);
for my $url (@urls){
print STDERR "URL: $url\n";
my $segment = url_segments($url);
for my $key ( sort keys %$segment ){
my $val = $segment->{$key};
print STDERR " - $key : $val\n";
}
print STDERR "\n";
}
exit;
sub url_segments {
my $url = shift;
my %segment = (
url => $url,
domain => undef,
location => undef,
location_relative => undef,
filename => undef,
filename_only => undef,
extension => undef,
query_string => undef,
protocol => undef,
);
if ( $url=~s/(\?.*)$// ) {# take out possible query string
$segment{query_string} = $1;
}
if ( $url=~s/^([a-z]{3,5})\:\/\///i ){
$segment{protocol} = $1;
}
if ( $url=~s/^([\w\.]+)// ){
$segment{domain} = $1;
}
if ( $url=~s/([^\/]+)\.([a-z0-9]{1,5})$// ){
$segment{filename} = "$1.$2";
$segment{filename_only} = $1;
$segment{extension} = $2;
}
if ( $url ){
$segment{location_relative} = $url;
$segment{location} = $segment{domain}.$segment{location_relative
+};
}
return \%segment;
}
# output:
URL: http://foo.com/downloads/pkg-5.6.tar.gz
- domain : foo.com
- extension : gz
- filename : pkg-5.6.tar.gz
- filename_only : pkg-5.6.tar
- location : foo.com/downloads/
- location_relative : /downloads/
- protocol : http
- query_string :
- url : http://foo.com/downloads/pkg-5.6.tar.gz
URL: http://bar.com/list/file-5.6-win32.zip
- domain : bar.com
- extension : zip
- filename : file-5.6-win32.zip
- filename_only : file-5.6-win32
- location : bar.com/list/
- location_relative : /list/
- protocol : http
- query_string :
- url : http://bar.com/list/file-5.6-win32.zip
URL: example.org/subd/golf.txt
- domain : example.org
- extension : txt
- filename : golf.txt
- filename_only : golf
- location : example.org/subd/
- location_relative : /subd/
- protocol :
- query_string :
- url : example.org/subd/golf.txt
URL: http://digg.com/page4
- domain : digg.com
- extension :
- filename :
- filename_only :
- location : digg.com/page4
- location_relative : /page4
- protocol : http
- query_string :
- url : http://digg.com/page4
URL: http://www.cnn.com/2008/US/07/17/beck.che.guevara/index.html
- domain : www.cnn.com
- extension : html
- filename : index.html
- filename_only : index
- location : www.cnn.com/2008/US/07/17/beck.che.guevara/
- location_relative : /2008/US/07/17/beck.che.guevara/
- protocol : http
- query_string :
- url : http://www.cnn.com/2008/US/07/17/beck.che.guevara/index.html
| [reply] [d/l] |
Re: Splitting an url in its components
by massa (Hermit) on Jul 17, 2008 at 15:47 UTC
|
If you _really_ want to reinvent the wheel, you can do with a single regex like:
use strict;
use warnings;
sub url_segments {
local($_) = @_;
my $segments = { url => $_ };
return unless m{
\A
(?: ([a-z]{3,6}) : (?://)? )? # proto
( [:a-z0-9@._-]+ ) # domain
(?:
( .*? / ) # location
(
( [^/]*? ) # filename_only
( (?: \. \w*[a-z]\w* )* ) # ext
)
(?: ( \? .* ) )? # query
)?
\Z
}x;
$segments->{proto} = $1 if $1;
$segments->{domain} = $2 if $2;
$segments->{location} = $3 if $3;
$segments->{filename} = $4 if $4;
$segments->{filename_only} = $5 if $5;
$segments->{ext} = $6 if $6;
$segments->{query} = $7 if $7;
$segments
}
use YAML;
print Dump( { map { $_ => url_segments $_ } qw(
http://foo.com/downloads/pkg-5.6.tar.gz
http://bar.com/list/file-5.6-win32.zip
example.org/subd/golf.txt
http://digg.com/page4
http://digg.com/page4?format=printable
http://www.cnn.com/2008/US/07/17/beck.che.guevara/index.html
https://tomwtriker:imzadi@coldmail.net/goethe.3.5.tar
mailto:tomwtriker@coldmail.net
) } )
Result:
---
example.org/subd/golf.txt:
domain: example.org
ext: .txt
filename: golf.txt
filename_only: golf
location: /subd/
url: example.org/subd/golf.txt
http://bar.com/list/file-5.6-win32.zip:
domain: bar.com
ext: .zip
filename: file-5.6-win32.zip
filename_only: file-5.6-win32
location: /list/
proto: http
url: http://bar.com/list/file-5.6-win32.zip
http://digg.com/page4:
domain: digg.com
filename: page4
filename_only: page4
location: /
proto: http
url: http://digg.com/page4
http://digg.com/page4?format=printable:
domain: digg.com
filename: page4
filename_only: page4
location: /
proto: http
query: '?format=printable'
url: http://digg.com/page4?format=printable
http://foo.com/downloads/pkg-5.6.tar.gz:
domain: foo.com
ext: .tar.gz
filename: pkg-5.6.tar.gz
filename_only: pkg-5.6
location: /downloads/
proto: http
url: http://foo.com/downloads/pkg-5.6.tar.gz
http://www.cnn.com/2008/US/07/17/beck.che.guevara/index.html:
domain: www.cnn.com
ext: .html
filename: index.html
filename_only: index
location: /2008/US/07/17/beck.che.guevara/
proto: http
url: http://www.cnn.com/2008/US/07/17/beck.che.guevara/index.html
https://tomwtriker:imzadi@coldmail.net/goethe.3.5.tar:
domain: tomwtriker:imzadi@coldmail.net
ext: .tar
filename: goethe.3.5.tar
filename_only: goethe.3.5
location: /
proto: https
url: https://tomwtriker:imzadi@coldmail.net/goethe.3.5.tar
mailto:tomwtriker@coldmail.net:
domain: tomwtriker@coldmail.net
proto: mailto
url: mailto:tomwtriker@coldmail.net
| [reply] [d/l] [select] |
|
|
(replying to self, duh) Here is a more complete regex:
sub url_segments {
local($_) = @_;
my $segments = { url => $_ };
my @matches = m{
\A
(?:
([a-z]{3,6}) # proto
:
(?: // )?
)?
(?:
(?:
([a-z0-9_-]+) # user
(?: : ([^@]+) ) @ # password
)?
( [a-z0-9._-]+ ) # domain
(?: : (\d+) )? # port
)
(?:
( .*? / ) # location
(
( [^/]*? ) # filename_only
( ( \. \w*[a-z]\w* )* ) # ext
)
(?: \? ( .* ) )? # query
)?
\Z
}xi;
return unless @matches;
my @parts = qw(proto user pass domain port location
filename filename_only ext query);
$matches[$_] and
$segments->{$parts[$_]} = $matches[$_] for 0 .. $#parts;
$segments
}
| [reply] [d/l] |
|
|
Wow, that's pretty freaking elegant.
I like this example more than the one after, because it's legible.
Nice YAML touch- Thank you for showing us!
| [reply] |
|
|
At your service... :-)
I like the first example better, too, but the second one gets more parts of the URI separated... and that's why I had to break it in so many lines! As for YAML, it's my favourite debugging tool (I think it's far more legible than Data::Dump(er)?)
| [reply] |
|
|
|
|