gri6507 has asked for the wisdom of the Perl Monks concerning the following question:
Given a string of type BABABBB and a pattern of AB, I would like to split the string into 3 sections: first=B, middle=ABAB, last=BB. I got the following pertinent code
use strict; my $string; read(DATA,$string,7); print "$string\n"; my $pattern = "AB"; print "pattern is $pattern\n"; my ($start,$middle,$end) = $string =~ /^(.*?)($pattern+)(.*?)$/g; print "splitting\n"; print "start = $start\n"; #gets B print "middle = $middle\n"; #gets AB print "end = $end\n"; #gets ABBB __DATA__ BABABBB
What is wrong with this regex? Please help. Thanks.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Regex help
by jeffa (Bishop) on Aug 24, 2003 at 15:31 UTC | |
By placing the + outside of the first "parened" $pattern, you allow more than one - then, put some parens around that to catpure the results to $2. Hope this helps, :) UPDATE:Oops, almost got that right ... now we are trying to match 4 items, not 3 anymore ... so try this: UPDATE 2: I like liz's and CombatSquirrel's suggestion to use a
UPDATE 3: jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat) | [reply] [d/l] [select] |
by CombatSquirrel (Hermit) on Aug 24, 2003 at 15:39 UTC | |
Suggested alternatives: gri6507, note that /$pattern+/ is exactly the same as /AB+/ (well, at least for $pattern = 'AB'), which is, by definition /A(:?B)+/. Hope this helped. CombatSquirrel. Update: Arghh - wrong order for capturing and non-capturing parens. Fixed. Update 2: jeffa is right. The following RegEx should do the trick: I'm open for any suggestions, and yes, I do know Mastering Regular Expressions, I just forgot half (the important half) of it. Update 3 (Explanation): The RegEx engine tries to match at the earliest possible position. Therefore it will always match nothing to be captured in $1 (non-greedy dot-star), the highest possible number of following pattern matches (greedy star) and then the rest. Meaning, if the first pattern does not begin at the first character, $2 will also be empty (after all a star does not have to match) and the rest is slurped into $3. Bon appetit! | [reply] [d/l] [select] |
by BrowserUk (Patriarch) on Aug 24, 2003 at 15:44 UTC | |
You can avoid the extraneous capture by using non-capturing parens.
...but you know that:) Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller If I understand your problem, I can solve it! Of course, the same can be said for you. | [reply] [d/l] |
by gri6507 (Deacon) on Aug 24, 2003 at 15:35 UTC | |
| [reply] |
|
Re: Regex help
by liz (Monsignor) on Aug 24, 2003 at 15:35 UTC | |
The problem with your version was that the + in the second container was just +ing the B, so you need to group around the string "AB". But then you get only 1 AB! Since you want to have all AB's, you need to capture that whole thing again. So there are grouping parentheses around that again. And to not change the order of the captured strings, the inner one has ?: which indicates that it's just a grouping and not a capture. Hope this helps. Liz | [reply] [d/l] |
|
Re: Regex help
by BrowserUk (Patriarch) on Aug 24, 2003 at 15:49 UTC | |
Another way to do this would be with split.
Much of a muchness in this case, but it does show the little used technique of using capturing brackets with split to retain the bits that would otherwise be discard, which is sometimes useful. Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller If I understand your problem, I can solve it! Of course, the same can be said for you. | [reply] [d/l] |
by bart (Canon) on Aug 24, 2003 at 16:25 UTC | |
Try something that contains this pattern twice, and you'll immediately see the difference, as in Yours would have put just "y" into $end, mine takes the entire rest of the string, "yABz". | [reply] [d/l] [select] |
|
Re: Regex help
by davido (Cardinal) on Aug 24, 2003 at 16:48 UTC | |
Eventually someone will hit on the right technique; one that isn't plagued by lazy regexp engines, greedy matching, etc. But there's another possiblity... You could make it easier on yourself, not worrying about trying to match ^(.*?) nongreedily, or about the lazy engine, or about (.*?)$ slurping everything up. Do it like this:
You take a performance hit in all regexp's in the program for using $` and $', but as I understand it, introducing capturing parens also introduces a similar performance hit for the current regular expression. And in non-time-critical operations (anything outside of tight loops) you don't really need to worry about the performance anyway right? ...so just do it the easy way. If it turns out that you can't live with the speed-efficiency hit taken by leaning toward programming-efficiency, you can dig into other solutions. But the fact is that $`, $', and $& are there to be used, as long as you understand the ramifications of their use. To my knowledge, their use isn't deprecated, and it would seem that newer releases of Perl have even taken steps to make the use of those special variables more speed-efficiency friendly. When the solution becomes so tricky that a dozen followup posts are still debating how to accomplish it, I think it's time to implement Perl's credo: There is more than one way to do it. (Start looking for a simpler solution). To that end, give my example a try. Hope this helps...
Dave "If I had my life to do over again, I'd be a plumber." -- Albert Einstein | [reply] [d/l] |
by bart (Canon) on Aug 24, 2003 at 16:53 UTC | |
This is an old problem, and the reason why use of $` and $' is frowned upon for larger scripts. Though I'm almost sure that the perl5porters will find ways to minimize this problem over time. | [reply] |
by davido (Cardinal) on Aug 24, 2003 at 17:09 UTC | |
The performance hit will be among all regexp's in the program, including those that don't use either those special variables, or capturing parenthesis. In the Camel book, one item under "Time Efficiency" is not to use $`, $&, and $'. However, one item under "Programmer Efficiency" is to use $`, $%, and $'. To me that says, weigh the time vs. programming simplicity paradox, and choose whichever one you feel is the best for your situation. The OP's code section was brief. Solving it using non-greedy matches, non-capturing and capturing parens, and a slightly-tricky regexp proved to be the topic of a dozen or so post replies in the thread. That tells me that the solutions that followed in the spirit of the OP's methodology were all too complex for the simple problem trying to be solved. That led me to decide, why not take the simpler, less time efficient, but much more programming efficient approach. It would be wrong to say that the use of $`, $&, and $' are depricated. Their use is clearly not. It just comes with a caviet: Use them but understand that they will cause a time performance issue with regexp's in your program. It is probably safe to say that at some point that will become less of an issue, as Perl continues to grow and develop. And clearly Perl's designers intend to keep those special variables, not just for backward compatibility, but for their continued use. 5.8.0, for example, has found a way to minimize the impact of $&. I wouldn't be surprised to see the impact of $` and $' get improved upon in the future, though I can't claim to know what's going on in the minds of Perl's developers. Anyway, sorry to get longwinded. I just wanted to explain that it is ok to make a conscious decision to use one method over another, as long as you understand the ramifications of each method.
Dave "If I had my life to do over again, I'd be a plumber." -- Albert Einstein | [reply] |
|
Re: Regex help
by bart (Canon) on Aug 24, 2003 at 16:46 UTC | |
or It also incorporates regex switches into the regex, so you can't globally override them. For example, if $pattern looks like "A B", if you use it in /$pattern/x, the space would be stripped from the subpattern. But not with qr. | [reply] [d/l] [select] |