Re: strip out anything inbetween brackets
by cog (Parson) on Apr 05, 2005 at 14:07 UTC
|
The naive approach would be
$string =~ s/\(.*\)//;
Which would do the trick in this particular case, but would convert "this is a (blah) and this is not a (blah)" in "this is a ", which is why you should use a non-eager quantifier:
$string =~ s/\(.*?\)//;
This does the trick...
Don't forget, however, to use the /g switch (for global substitutions). Also, your example has the result as being "this is a" (notice there's no space after the a...)
If that's what you want, you just need to include \s* on both ends of your regular expression...
OTOH, that would turn "this is a (blah) bleh" into "this is ableh", which is probably not what you want... O:-) | [reply] [d/l] [select] |
|
Well I was going to post to say you should really be checking using a negated character class, rather than having all that backtracking going on. I was pretty sure it'd be faster, and it's what I would normally do when coding regexes like this.
I did a quick benchmark first, and it turns out I was wrong, the negated character class get relatively more and more inefficient the longer the data it has to scoop up is. Twice as much as proved here.
use strict;
use Benchmark qw(:all) ;
my $count = 50000;
my $replacement_string = "this is a (" . "a"x1000 . ") test";
cmpthese($count, {
'negated' => sub {
my $text = $replacement_string;
$text =~ s|\([^)]*\)||sg;
},
'backtrack' => sub {
my $text = $replacement_string;
$text =~ s|\(.*?\)||sg;
},
});
OUTPUT
Rate negated backtrack
negated 8562/s -- -67%
backtrack 26316/s 207% --
I still think there's something to be said for the character class, as it is more explicit (after all, we are trying to match anything other than the closing bracket.), but it it certainly slower.
This surprised me, so I thought I'd post it, incase it surprised anyone else. | [reply] [d/l] |
|
The second one will work provided that you don't have nested parens:
This is a ((very important) blah)
If there's a possibility of that sort of thing happening, you'll probably want to look at Pustular Postulant's recommendation, and not use a regex. (I don't know that exact module, so if it'll handle it, or if you need to look for something else) I've typically run into this problem with SGML, so used a parser specifically for HTML or XML... I don't know if there's something that does nested braces and the like.
| [reply] [d/l] |
|
perfect thankyou for your help
| [reply] |
|
| [reply] |
Re: strip out anything inbetween brackets
by tlm (Prior) on Apr 05, 2005 at 14:16 UTC
|
In general, it is difficult to deal with balanced delimiters using regexps alone (the problem arises when you have nested parenthesized expressions). You may want to take a look at Text::Balanced.
| [reply] |
|
For this simple case, maybe a simple "do while you can" can do the job:
#!/usr/local/bin/perl
local $_="woaaa (nested (parens))";
while (s/\([^()]*\)//){
print "$_\n";
}
__OUTPUT__
ouu (nested )
ouu
But it's possible that I'm missing something... | [reply] [d/l] |
|
Try your code against $_="woaaa (nested (parens)))".
Update: Well, that's not fair of me, because the parens are not balanced; but it does show that the general problem has wrinkles. Judging by follow-up posts it looks like the OP's problem is simple enough to be dealt with a plain regexp. Text::Balanced shines when you have to process large chunks of text (e.g. source code) with balanced expressions.
| [reply] [d/l] |
|
|
Re: strip out anything inbetween brackets
by ww (Archbishop) on Apr 05, 2005 at 15:19 UTC
|
In addition to the suggestions about, you may also wish to note that the "(" and ")" in your question are "parentheses" rather than "brackets" (which come in several flavors, including square, [ and ], curly, { and } and < and >). It's UNimportant in this case, but would be very significant in most code.
Likewise, what you call "speech marks" are (if I understand you correctly) "quotation marks" or, loosely and idiomatically, "double quotes" (to distinguish from single quotes, " ' "
and I second the above: get an account. You'll find the monastery a welcoming and helpful place.
| [reply] |
|
you may also wish to note that the "(" and ")" in your question are "parentheses" rather than "brackets"
Don't be so dogmatic! I don't know where the original poster is from, but in everyday speech (and indeed punctuation manuals) in the UK "(" and ")" are indeed called "brackets"; the same may well be true in other places. The word "parentheses" is known in the UK, but it's rarely heard and somebody using it risks sounding pretentious.
Likewise, what you call "speech marks" are (if I understand you correctly) "quotation marks" or, loosely and idiomatically, "double quotes"
"Speech marks" is also a commonly understood term in the UK.
TMTOWTDI! Other people may come from cultures which use different terms for some things. That's OK — as human beings we can cope with occasionally having to take a second longer to read an unfamiliar phrase. It certainly doesn't mean that 'your' terms are 'right' and the other person's are 'wrong'.
Smylers
| [reply] |
|
smylers:
You're right, of course. I, especially, should not fall into a regionalistic trap like that. but, on the other hand, I will cheerfully risk begin viewed as "pretentious" if that's the result of an effort to communicate clearly, in the language of the listener or reader.
But, for my info, how does UK-English distinguish among parens, squarebrackets, angle-brackets and curly-brackets? (others offering distinctive regionalisms or national-useages encouraged too!)
and, for what LITTLE it's worth, I do not recall hearing (as a child in Edinburgh) any teacher referring to "" as speech-marks.
| [reply] |
|
|
I'll take some issue with some of your linguistic pedantry. :)
Parentheses are a type of bracket, surely. What do they do but bracket things? You point out that backets come in the square, curly, and angle flavours. Why leave out round flavour?
What's wrong with using the term 'speech marks'? Speech is double quoted in almost all literature I've come across, and I certainly grew up referring to double quotes as such. To insist on calling them quotation marks seems odd, especially when you go on to talk about double and single 'quotes' ('quotes' almost certainly being an abbreviation of 'quotation marks'). I've always called single quotes single quotes, though.
While I appreciate your intent to educate, I do think you're incorrect or misleading on some of these points. But do correct me if you think I'm wrong.
I'd like to point out that I was doing some work while writing this reply, and smylers nipped in in front of me ;)
| [reply] |
|
oops. Now ya got me ranting (and nitpicking): despite the smile, "pedantry" (of which I can be guilty :>}) twists my arms and makes me disagree (on grounds of insuffient precision and breadth) with "Speech is double quoted in almost all literature I've come across,...."
Specifically, printed versions of speech are customarily double-quoted except when that speech is both single- and double-quoted (a quote of another utterance) and, in fact, depending on the narrowness of your definition of speech (ie, if narrower than a (US?) legalism in which speech includes writing, and sometimes even throwing paint at a wall), then double-quotes are also used to indicate an utterance in words, regardless of the technique.
[/rant] :<}
| [reply] |
|
| [reply] |
|
The toughest part of any technology project is communicating. Communicating the original specifications can be excrusiating some times and communications among team members takes up much time to ensure that every idea has been effectively passed between team members.
English is a wonderful language, it borrows from other languages whenever it feels like it. There is almost always more than one way (and often many) ways to say the same thing. It reminds me of a certain scripting language...
| [reply] |