johnnywang has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm trying to extract the beginning part of a path like string. The following script:
use strict; my @input = ("A/B/C/D/E/F/", "A/B/C/D/","A/B/C/D","A/B/C/", "A/B/C","A +/B","A"); foreach (@input){ print "$1\n" if m|((?:\w+/){4})|; }
only matches the first two elements, and gives:
A/B/C/D/ A/B/C/D/

I'd like to have a regex that matches all elements in the input, i.e., match up to level 4, and also make the "/" at the end optional. Desired output:

A/B/C/D/ A/B/C/D/ A/B/C/D A/B/C/ A/B/C A/B A
What's the best way? Thanks.

Replies are listed 'Best First'.
Re: regex greedy range
by ikegami (Patriarch) on Sep 16, 2004 at 23:34 UTC

    Going from
        print "$1\n" if m|((?:\w+/){4})|;
    to
        print "$1\n" if m|((?:\w+/){0,4})|;
    gets you halfway there. Add a '?':
        print "$1\n" if m|((?:\w+/?){0,4})|;
    and there you are.

    '{4}' means match exactly 4 times, whereas '{0,4}' means match up to 4 times. The '?' makes the '/' optional.

      Except now all parts of your regex are optional so it will match even the empty string. Also, quantified parens containing only quantified terms are a recipe for eventual disaster. I'd rather write this like so:

      m|(\w+(?:/\w+){1,3})|

      Makeshifts last the longest.

Re: regex greedy range
by johnnywang (Priest) on Sep 16, 2004 at 23:50 UTC
    Thanks. The reason I titled it as "greedy range" is that I thought {2,4} will stop as soon as it matched 2 instances, I guess everything is greedy unless explicitly stated otherwise. Well, I should have just tried it, PM is making me lazier. Thanks.

      Correct, everything is greedy unless you add the '?'.
      greedy: a?, a*, a+, a{m,n}
      !greedy: a??, a*?, a+?, a{m,n}?

Re: regex greedy range
by Aristotle (Chancellor) on Sep 16, 2004 at 23:19 UTC

    Well, if you want to capture all the parts, then, well capture all the parts.

    my @input = qw( A/B/C/D/E/F/ A/B/C/D/ A/B/C/D A/B/C/ A/B/C A/B A ); foreach (@input){ next if not m!((((\w+/)\w+/)\w+/)\w+/?)!; print "$_ has $4 $3 $2 $1\n"; }

    Misread the question…

    Makeshifts last the longest.

      You forgot a bunch of question marks, and you're using capturing when you only need grouping:

      m!((((\w+/)\w+/)\w+/)\w+/?)!;
      should be:
      m!((?:(?:(?:\w+/)?\w+/)?\w+/)?\w+/?)!;
      but that requires lots of backtracking, so I think it's less efficient than:
      m!(\w+(?:/\w+(?:/\w+(?:/\w+)?)?)?)!;
      which only requires a single character lookahead.

        No, I didn't forgot the question marks, and I used capturing parens on purpose. But I was answering a different question than was actually asked.

        It could actually turn out more efficient with a slight variation:

        m!((?>(?>(?>\w+/)?\w+/)?\w+/)?\w+/?)!;

        I haven't done any benchmarks though.

        Makeshifts last the longest.