xipho has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,
While I can loop through a string and solve this problem, I'm wondering if there is a regex that can do the same (I'm sure there is, likely using {}).

Given: 1) a string like $foo below; and 2) an arbitratry integer n which is < the count of some character $c (below $c = 'x') in $foo. Find, starting from the begining of $foo (^), the longest shortest string containing exactly n $c, (which will obviously be bound on both sides by $c).
$foo = "xxx.x.....x......xxx...xx...x...xxx"; sub bar { my $n = shift; # do regex stuff on $foo ... return $1 # or some such } # expected results for # n = # 1 - x # 2 - xx # 3 - xxx # 4 - xxx.x # 5 - xxx.x.....x # 6 - xxx.x.....x......x # ... etc.

thanks, x

update I do mean shortest! and result for 4 fixed. My apologies.

Replies are listed 'Best First'.
Re: regex finding longest string containing n of $c
by kvale (Monsignor) on Sep 01, 2005 at 03:51 UTC
    It is not 'obvious' why the string should be bound by x's on both sides if you want the longest string. For n x's, the longest strings would seem to be
    n = 1 - x 2 - xx 3 - xxx. 4 - xxx.x.... 5 - xxx.x.....x......
    in other words, don't forget the trailing non-x caharacters to lengthen the string.

    A regex to capture these strings is

    my $regex = '\A' . ('[^x]*x' x $n) . '[^x]*'; $foo =~ m{ ($regex) }xms; # PBP orthodoxy :)
    Update: With the problem redefined, the new regex should be retooled to use parsimonious quantifiers to find the shortest strings:
    my $regex = '\A' . ('[^x]*?x' x $n);
    Here I have also left off the last piece of the regex, as it is a noop.

    -Mark

      That will work (with a little modification for the revamped question too). Hadn't thought of just repeating the search letter with * between- simple now that I see it. Thanks!

        A regex is not going to do it for you because it will find the first match regardless of the length of the match. If you need to find the shortest (or longest) match then you need to find all the matches and select the best. A regex on it's own generally doesn't do that.


        Perl is Huffman encoded by design.
Re: regex finding shortest string containing n of $c
by inman (Curate) on Sep 01, 2005 at 08:34 UTC
    As mentioned earlier, the question is ambiguous and so is subject to interpretation. (It also smells of homework but its interesting so I will let it pass.)

    The question could be interpreted as follows:

    1. Anchor the pattern at the beginning of the string in which case kvale has provided the answer earlier.
    2. Start the matching process at the beginning of the string and find any part of the string that is bounded by x that such that it contains the desired number (n) of x's.

    The second alternative is more difficult than it appears. Firstly, there is a special case where n is 1 or 2. Secondly, the first x to be matched can appear within a match that has already taken place.

    In order to demonstrate this problem I have added an extra x to the beginning of the data and I am looking for 4 x's. The shortest string containing four x's is xxx.x but the first match that the regex finds lands in the middle of this and so the next match misses the shortest string.

    my $foo = "x.....xxx.x.....x......xxx...xx...x...xxx"; my @array = $foo =~ /( # capture in $1 [x] # match an opening x (?:[^x]*?[x][^x]*?) # match an x surrounded by some non-x {2} # match n-2 times [x] # match a closing x )/gx; #(?{push @array, $1}) # keep the match #$ # match the end of the string to forc +e back-tracking print "$_\n" foreach (sort {length $a <=> length $b} @array); __END__ x.....xxx x...xx...x x.....x......xx

    In order to resolve this situation, you can use backtracking in the regex. The following regex embeds code to store the matches. In this way, all of the possible matches are found (albeit a number of times).

    my $foo = "x.....xxx.x.....x......xxx...xx...x...xxx"; my @array; $foo =~ /( # capture in $1 [x] # match an opening x (?:[^x]*?[x][^x]*?) # match an x surrounded by some non-x {2} # match n-2 times [x] # match a closing x ) (?{push @array, $1}) # keep the match $ # match the end of the string to force + back-tracking /x; print "$_\n" foreach (sort {length $a <=> length $b} @array); __END__ xxx.x xxx...x xx...xx xx...xx xx...xx xx...xx x...xxx x.....xxx xx.x.....x xx.x.....x x......xxx x...xx...x xx...x...x xx...x...x xx...x...x xx...x...x x...x...xx x...x...xx x...x...xx x...x...xx x.....x......xx x.....x......xx x.....x......xx x.....x......xx x.....x......xx x.....x......xx x.....x......xx x.x.....x......x x.x.....x......x x.x.....x......x x.x.....x......x x.x.....x......x x.x.....x......x
      Well, I am working at home, but this is definitely not homework (see the only other node I've written). I'm going to have to put a disclaimer in my sig I think. Thanks for the discussion- a nice refence for future work.
Re: regex finding longest string containing n of $c
by Errto (Vicar) on Sep 01, 2005 at 03:40 UTC
    Based on your description of the problem and sample results, it looks like you mean shortest, not longest. Otherwise the answers are more like
    1 - .....x...... 2 - .x.....x...... 3 - .x.....x......x 4 - x.x.....x......x
      Your right- my apologies, I am looking for shortest!
Re: regex finding longest string containing n of $c
by GrandFather (Saint) on Sep 01, 2005 at 03:42 UTC

    Your expected results do not match your description - I would have expected line 4 to contain 4 x's.

    It is not obvious that the longest string will be bounded by $c. Or do you mean that the longest string must be terminated at each end by $c such that, if $n == 1, then ".....x......" is not acceptable, but 'x' is?

    I doubt that using a regex is the most efficient way to solve this. Is it a requirement?


    Perl is Huffman encoded by design.
      I did mess up- meant shortest (see change in title). I think this then implies the bounding I mentioned? I don't need a regex- was just curious if such was possible. Thanks for the help.

        You should change your question too. Best to <strike> the incorrect text (so others can see what the fuss was about) and add the correct text.


        Perl is Huffman encoded by design.
Re: regex finding shortest string containing n of $c
by japhy (Canon) on Sep 01, 2005 at 12:02 UTC
    This would've been more interesting if it weren't confined to the start of the string. And by "interesting", I mean it would have presented an opportunity to showcase forced backtracking in a regex among some of the other powerful things regexes can do.

    Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
    How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart