Regex question

cajun has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to capture some information from some URLs so that I can store the downloaded data into directories that make some sense.

The type of URL's I'm downloading the data from are as follows:

http://www.domain.com/data/2005/sales/01012005.txt
http://www.domain.com/data/2005/sales-jan/01232005.txt
http://www.domain.com/data/2005/sales-local/01012005.txt
http://www.domain.com/data/2005/sales-outside-jan/01012005.txt
...
...
[download]

What I want to extract from this is:

sales
sales-jan
sales-local
sales-outside-jan
...
...

The regex I have come up with to extract the information is:

$dir = $1 if /\/(\w+(|-\w+|-\w+-\w+))\/\w+\.txt$/;
[download]

This regex appears to be working correctly. My question is am I going about it the right way? Could I have shortened the regex somehow?

Thanks,
Mike

Update: Corrected typo in regex that GrandFather found.

Thanks GrandFather and ikegami for the suggestions. Yes, I should have used a different delimiter, the leaning toothpicks are confusing. I understand GrandFather's suggestion, but I'll have to study ikegami's suggestion a bit. Thanks!

Update II: Thanks to all for the great responses / ideas. Thanks to davidrw & YuckFoo for their suggestions on the split. Frankly I hadn't even thought of that. I became so wrapped up in the regex to get the directory, I hadn't even thought about the filename yet. Clearly a case of not seeing the forest for the trees.

Comment on Regex question Select or Download Code

Replies are listed 'Best First'.
Re: Regex question by ikegami (Patriarch) on Aug 19, 2005 at 01:58 UTC
You're being too specific. Your goal is simply to get the last thing between two slashes. `my ($dir) = m{/([^/]+)/[^/]$};` [download] Notice I changed the regexp's delimiters so I didn't have to escape the slashes. I removed the `if`. That will set `$dir` to undef on failure. Add the `if` back if you want to keep `$dir`'s previous value on failure. Technically, you could omit the leading slash from the regexp, but I think it'll be more efficient with it. Update: Since you mentioned you wanted to study my regexp further, what follows might help. Read from the comments from the bottom up. `my ($dir) = m{ / # Preceeded by a slash. ([^/]+) # Preceeded by non-slashes, the dir. Captured. / # Preceeded by a slash. [^/] # Preceeded by non-slashes, the file name. $ # End of string. }x;` [download] Update: Oops! I forgot the parens around `$dir`. Added.	[reply] [d/l] [select]
Re^2: Regex question by Roy Johnson (Monsignor) on Aug 19, 2005 at 03:12 UTC
Could also be done `m{./(.)/}` [download] Caution: Contents may have been coded under pressure.	[reply] [d/l]
Re^3: Regex question by radiantmatrix (Parson) on Aug 19, 2005 at 16:07 UTC
Or, to mirror the constraints in the OP's regex: `m{^.* / ([\w-]+) / .\.txt}x` [download] Also, the form: `my $dir; $dir = $1 if ##regex_here##` [download] is a good idiom to become familiar with. Updates:* 2005-08.Aug-22 : fixed short-sighted error tlm points out below. I keep getting bit on that. ;-) <-radiant.matrix-> Larry Wall is Yoda: there is no `try{}` (ok, except in Perl6; way to ruin a joke, Larry! ;P) The Code that can be seen is not the true Code "In any sufficiently large group of people, most are idiots" - Kaa's Law	[reply] [d/l] [select]
Re^4: Regex question by tlm (Prior) on Aug 19, 2005 at 17:45 UTC
Re: Regex question by davidrw (Prior) on Aug 19, 2005 at 02:25 UTC
besides the regex solutions above, if ikegami's assumption is right that you just want the last thing between two slashes you can also use `split` instead of a direct regex... `$s="http://blah/foo/stuff/more.txt"; print +(split('/',$s))[-2]; # stuff` [download]	[reply] [d/l] [select]
Re: Regex question by YuckFoo (Abbot) on Aug 19, 2005 at 02:35 UTC
cajun, Seems like a natural for split. You get the bonus filename too. SplitFoo `#!/usr/bin/perl use strict; while (my $line = <DATA>) { chomp $line; my ($dir, $file) = (split(m{/}, $line))[-2, -1]; print "$dir $file\n"; } __DATA__ http://www.domain.com/data/2005/sales/01012005.txt http://www.domain.com/data/2005/sales-jan/01232005.txt http://www.domain.com/data/2005/sales-local/01012005.txt http://www.domain.com/data/2005/sales-outside-jan/01012005.txt` [download]	[reply] [d/l]
Re: Regex question by GrandFather (Saint) on Aug 19, 2005 at 01:57 UTC
Slightly improved, but not exactly the same match: `m{/(\w+(-\w+)*)/\w+\.txt$}`. Note that the . is quoted so it matches a ., not any character. Note also that -\w+ is allowed any number of times (including 0). Perl is Huffman encoded by design.	[reply] [d/l]