gri6507 has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

Given a string of type BABABBB and a pattern of AB, I would like to split the string into 3 sections: first=B, middle=ABAB, last=BB. I got the following pertinent code

use strict; my $string; read(DATA,$string,7); print "$string\n"; my $pattern = "AB"; print "pattern is $pattern\n"; my ($start,$middle,$end) = $string =~ /^(.*?)($pattern+)(.*?)$/g; print "splitting\n"; print "start = $start\n"; #gets B print "middle = $middle\n"; #gets AB print "end = $end\n"; #gets ABBB __DATA__ BABABBB

What is wrong with this regex? Please help. Thanks.

Replies are listed 'Best First'.
Re: Regex help
by jeffa (Bishop) on Aug 24, 2003 at 15:31 UTC
    Not really sure what the point of this is, but if you add some more parens you should get what you want:
    my ($start,$middle,$end) = $string =~ /^(.*?)(($pattern)+)(.*?)$/g;
    By placing the + outside of the first "parened" $pattern, you allow more than one - then, put some parens around that to catpure the results to $2.

    Hope this helps, :)

    UPDATE:
    Oops, almost got that right ... now we are trying to match 4 items, not 3 anymore ... so try this:
    my ($start,$middle,undef,$end) = $string =~ ...
    UPDATE 2:
    I like liz's and CombatSquirrel's suggestion to use a look ahead non-capturing paren group ... but none of these solutions (yes bart, including mine ;)) are robust. For example, none will work with the pattern BABABABBB ...

    UPDATE 3:
    split! of course! i like it!

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
      Errm, jeffa, that woud screw the my ($start,$middle,$end) part up.
      Suggested alternatives:
      # first one my ($start, $middle, undef, $end) = $string =~ /^(.*?)(($pattern)+)(.* +?)$/g; # second one, with non-capturing parens -- I like it better my ($start,$middle,$end) = $string =~ /^(.*?)((?:$pattern)+)(.*?)$/g;
      gri6507, note that /$pattern+/ is exactly the same as /AB+/ (well, at least for $pattern = 'AB'), which is, by definition /A(:?B)+/.
      Hope this helped.
      CombatSquirrel.

      Update: Arghh - wrong order for capturing and non-capturing parens. Fixed.

      Update 2: jeffa is right. The following RegEx should do the trick:
      $pattern = 'AB'; 'BABABABBB' =~ / ^ # start at beginning of line ( # capture to $1 .*? # a number of character, but as few as possi +ble ... (?<!$pattern) # ... which may not contain $pattern ) ( # capture to $2 (?:$pattern)+ # multiple occurences of $pattern | # OR (?!.*?$pattern) # nothing, BUT there may be no $pattern in t +he rest # of the string ) (.*) # capture rest to $3 /x ; print "$1<$2>$3$/"; __END__ prints "B<ABABAB>BB"
      I'm open for any suggestions, and yes, I do know Mastering Regular Expressions, I just forgot half (the important half) of it.

      Update 3 (Explanation): The RegEx engine tries to match at the earliest possible position. Therefore it will always match nothing to be captured in $1 (non-greedy dot-star), the highest possible number of following pattern matches (greedy star) and then the rest. Meaning, if the first pattern does not begin at the first character, $2 will also be empty (after all a star does not have to match) and the rest is slurped into $3. Bon appetit!

      You can avoid the extraneous capture by using non-capturing parens.

      my( $start, $middle, $end ) = $string =~ m[^ (.*?) ( (?:AB)+ ) (.*?) $ +]xg

      ...but you know that:)


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
      If I understand your problem, I can solve it! Of course, the same can be said for you.

      Thank you, that does exactly what I need.
Re: Regex help
by liz (Monsignor) on Aug 24, 2003 at 15:35 UTC
    They're probably better ways to do this, but this is my take:
    my ($start,$middle,$end) = $string =~ /^(.*?)((?:$pattern)+)(.*?)$/g;
    The problem with your version was that the + in the second container was just +ing the B, so you need to group around the string "AB". But then you get only 1 AB!

    Since you want to have all AB's, you need to capture that whole thing again. So there are grouping parentheses around that again. And to not change the order of the captured strings, the inner one has ?: which indicates that it's just a grouping and not a capture.

    Hope this helps.

    Liz

Re: Regex help
by BrowserUk (Patriarch) on Aug 24, 2003 at 15:49 UTC

    Another way to do this would be with split.

    my( $start, $middle, $end ) = split /((?:AB)+)/,$string ;

    Much of a muchness in this case, but it does show the little used technique of using capturing brackets with split to retain the bits that would otherwise be discard, which is sometimes useful.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
    If I understand your problem, I can solve it! Of course, the same can be said for you.

      No, you really should have done this:
      my($start, $middle, $end) = split /((?:AB)+)/, $string, 2;
      Try something that contains this pattern twice, and you'll immediately see the difference, as in
      $string = "xAByABz";
      Yours would have put just "y" into $end, mine takes the entire rest of the string, "yABz".
Re: Regex help
by davido (Cardinal) on Aug 24, 2003 at 16:48 UTC
    There are a number of replies to your question, proposing a variety of solutions to your problem, while sticking with variations on your original regular expression that attempt match everything, capturing different parts of the match with capture-parenthesis.

    Eventually someone will hit on the right technique; one that isn't plagued by lazy regexp engines, greedy matching, etc. But there's another possiblity...

    You could make it easier on yourself, not worrying about trying to match ^(.*?) nongreedily, or about the lazy engine, or about (.*?)$ slurping everything up. Do it like this:

    my $pattern = "AB"; print "pattern is $pattern\n"; my ( $middle ) = $string =~ /($pattern+)/; my ( $start, $end ) = ( $`, $' ); #.... and so on....

    You take a performance hit in all regexp's in the program for using $` and $', but as I understand it, introducing capturing parens also introduces a similar performance hit for the current regular expression. And in non-time-critical operations (anything outside of tight loops) you don't really need to worry about the performance anyway right? ...so just do it the easy way.

    If it turns out that you can't live with the speed-efficiency hit taken by leaning toward programming-efficiency, you can dig into other solutions. But the fact is that $`, $', and $& are there to be used, as long as you understand the ramifications of their use. To my knowledge, their use isn't deprecated, and it would seem that newer releases of Perl have even taken steps to make the use of those special variables more speed-efficiency friendly.

    When the solution becomes so tricky that a dozen followup posts are still debating how to accomplish it, I think it's time to implement Perl's credo: There is more than one way to do it. (Start looking for a simpler solution). To that end, give my example a try.

    Hope this helps...

    Dave

    "If I had my life to do over again, I'd be a plumber." -- Albert Einstein

      You are correct, but the performance hit associated with $` and $' affects every regex in your script, not just this one you want to use it for.

      This is an old problem, and the reason why use of $` and $' is frowned upon for larger scripts. Though I'm almost sure that the perl5porters will find ways to minimize this problem over time.

        It is true that the time-performance hit for using $` and $' persists through every regexp in the program, because if those special variables are used just once, Perl makes the decision that all regexp's in the program should now use those special variables.

        The performance hit will be among all regexp's in the program, including those that don't use either those special variables, or capturing parenthesis.

        In the Camel book, one item under "Time Efficiency" is not to use $`, $&, and $'.

        However, one item under "Programmer Efficiency" is to use $`, $%, and $'.

        To me that says, weigh the time vs. programming simplicity paradox, and choose whichever one you feel is the best for your situation. The OP's code section was brief. Solving it using non-greedy matches, non-capturing and capturing parens, and a slightly-tricky regexp proved to be the topic of a dozen or so post replies in the thread. That tells me that the solutions that followed in the spirit of the OP's methodology were all too complex for the simple problem trying to be solved. That led me to decide, why not take the simpler, less time efficient, but much more programming efficient approach.

        It would be wrong to say that the use of $`, $&, and $' are depricated. Their use is clearly not. It just comes with a caviet: Use them but understand that they will cause a time performance issue with regexp's in your program. It is probably safe to say that at some point that will become less of an issue, as Perl continues to grow and develop. And clearly Perl's designers intend to keep those special variables, not just for backward compatibility, but for their continued use. 5.8.0, for example, has found a way to minimize the impact of $&. I wouldn't be surprised to see the impact of $` and $' get improved upon in the future, though I can't claim to know what's going on in the minds of Perl's developers.

        Anyway, sorry to get longwinded. I just wanted to explain that it is ok to make a conscious decision to use one method over another, as long as you understand the ramifications of each method.

        Dave

        "If I had my life to do over again, I'd be a plumber." -- Albert Einstein

Re: Regex help
by bart (Canon) on Aug 24, 2003 at 16:46 UTC
    If you would have just used qr, you wouldn't have had this problem. Plus, Perl would have checked for you if $pattern actually contains a valid regex on its own, and not just test on the larger pattern.
    my $pattern = qr/AB/;
    or
    my $pattern = "AB"; $pattern = qr/$pattern/;
    It also incorporates regex switches into the regex, so you can't globally override them. For example, if $pattern looks like "A B", if you use it in /$pattern/x, the space would be stripped from the subpattern. But not with qr.