cajun has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to capture some information from some URLs so that I can store the downloaded data into directories that make some sense.

The type of URL's I'm downloading the data from are as follows:

http://www.domain.com/data/2005/sales/01012005.txt http://www.domain.com/data/2005/sales-jan/01232005.txt http://www.domain.com/data/2005/sales-local/01012005.txt http://www.domain.com/data/2005/sales-outside-jan/01012005.txt ... ...
What I want to extract from this is:

sales
sales-jan
sales-local
sales-outside-jan
...
...

The regex I have come up with to extract the information is:

$dir = $1 if /\/(\w+(|-\w+|-\w+-\w+))\/\w+\.txt$/;

This regex appears to be working correctly. My question is am I going about it the right way? Could I have shortened the regex somehow?

Thanks,
Mike

Update: Corrected typo in regex that GrandFather found.

Thanks GrandFather and ikegami for the suggestions. Yes, I should have used a different delimiter, the leaning toothpicks are confusing. I understand GrandFather's suggestion, but I'll have to study ikegami's suggestion a bit. Thanks!

Update II: Thanks to all for the great responses / ideas. Thanks to davidrw & YuckFoo for their suggestions on the split. Frankly I hadn't even thought of that. I became so wrapped up in the regex to get the directory, I hadn't even thought about the filename yet. Clearly a case of not seeing the forest for the trees.

Replies are listed 'Best First'.
Re: Regex question
by ikegami (Patriarch) on Aug 19, 2005 at 01:58 UTC

    You're being too specific. Your goal is simply to get the last thing between two slashes.

    my ($dir) = m{/([^/]+)/[^/]*$};

    Notice I changed the regexp's delimiters so I didn't have to escape the slashes.

    I removed the if. That will set $dir to undef on failure. Add the if back if you want to keep $dir's previous value on failure.

    Technically, you could omit the leading slash from the regexp, but I think it'll be more efficient with it.

    Update: Since you mentioned you wanted to study my regexp further, what follows might help. Read from the comments from the bottom up.

    my ($dir) = m{ / # Preceeded by a slash. ([^/]+) # Preceeded by non-slashes, the dir. Captured. / # Preceeded by a slash. [^/]* # Preceeded by non-slashes, the file name. $ # End of string. }x;

    Update: Oops! I forgot the parens around $dir. Added.

      Could also be done
      m{.*/(.*)/}

      Caution: Contents may have been coded under pressure.

        Or, to mirror the constraints in the OP's regex:

        m{^.* / ([\w-]+) / .*\.txt}x

        Also, the form:

        my $dir; $dir = $1 if ##regex_here##

        is a good idiom to become familiar with.

        Updates:

        • 2005-08.Aug-22 : fixed short-sighted error tlm points out below. I keep getting bit on that. ;-)

        <-radiant.matrix->
        Larry Wall is Yoda: there is no try{} (ok, except in Perl6; way to ruin a joke, Larry! ;P)
        The Code that can be seen is not the true Code
        "In any sufficiently large group of people, most are idiots" - Kaa's Law
Re: Regex question
by davidrw (Prior) on Aug 19, 2005 at 02:25 UTC
    besides the regex solutions above, if ikegami's assumption is right that you just want the last thing between two slashes you can also use split instead of a direct regex...
    $s="http://blah/foo/stuff/more.txt"; print +(split('/',$s))[-2]; # stuff
Re: Regex question
by YuckFoo (Abbot) on Aug 19, 2005 at 02:35 UTC
    cajun,

    Seems like a natural for split. You get the bonus filename too.

    SplitFoo

    #!/usr/bin/perl use strict; while (my $line = <DATA>) { chomp $line; my ($dir, $file) = (split(m{/}, $line))[-2, -1]; print "$dir $file\n"; } __DATA__ http://www.domain.com/data/2005/sales/01012005.txt http://www.domain.com/data/2005/sales-jan/01232005.txt http://www.domain.com/data/2005/sales-local/01012005.txt http://www.domain.com/data/2005/sales-outside-jan/01012005.txt
Re: Regex question
by GrandFather (Saint) on Aug 19, 2005 at 01:57 UTC

    Slightly improved, but not exactly the same match: m{/(\w+(-\w+)*)/\w+\.txt$}. Note that the . is quoted so it matches a ., not any character. Note also that -\w+ is allowed any number of times (including 0).


    Perl is Huffman encoded by design.