Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

strip out anything inbetween brackets

by Anonymous Monk
on Apr 05, 2005 at 14:01 UTC ( [id://444983]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi I have a string containing "this is a (blah)" without the speech marks. How do I alter the string to remove the brackets and anything in between so that the string becomes "this is a"?

Replies are listed 'Best First'.
Re: strip out anything inbetween brackets
by cog (Parson) on Apr 05, 2005 at 14:07 UTC
    The naive approach would be

    $string =~ s/\(.*\)//;

    Which would do the trick in this particular case, but would convert "this is a (blah) and this is not a (blah)" in "this is a ", which is why you should use a non-eager quantifier:

    $string =~ s/\(.*?\)//;

    This does the trick...

    Don't forget, however, to use the /g switch (for global substitutions). Also, your example has the result as being "this is a" (notice there's no space after the a...)

    If that's what you want, you just need to include \s* on both ends of your regular expression...

    OTOH, that would turn "this is a (blah) bleh" into "this is ableh", which is probably not what you want... O:-)

      Well I was going to post to say you should really be checking using a negated character class, rather than having all that backtracking going on. I was pretty sure it'd be faster, and it's what I would normally do when coding regexes like this.

      I did a quick benchmark first, and it turns out I was wrong, the negated character class get relatively more and more inefficient the longer the data it has to scoop up is. Twice as much as proved here.

      use strict; use Benchmark qw(:all) ; my $count = 50000; my $replacement_string = "this is a (" . "a"x1000 . ") test"; cmpthese($count, { 'negated' => sub { my $text = $replacement_string; $text =~ s|\([^)]*\)||sg; }, 'backtrack' => sub { my $text = $replacement_string; $text =~ s|\(.*?\)||sg; }, }); OUTPUT Rate negated backtrack negated 8562/s -- -67% backtrack 26316/s 207% --
      I still think there's something to be said for the character class, as it is more explicit (after all, we are trying to match anything other than the closing bracket.), but it it certainly slower.

      This surprised me, so I thought I'd post it, incase it surprised anyone else.

      The second one will work provided that you don't have nested parens:

      This is a ((very important) blah)

      If there's a possibility of that sort of thing happening, you'll probably want to look at Pustular Postulant's recommendation, and not use a regex. (I don't know that exact module, so if it'll handle it, or if you need to look for something else) I've typically run into this problem with SGML, so used a parser specifically for HTML or XML... I don't know if there's something that does nested braces and the like.

      perfect thankyou for your help
        You are welcome, Anonymous Monk.

        You should also consider creating a user in this site :-)

Re: strip out anything inbetween brackets
by tlm (Prior) on Apr 05, 2005 at 14:16 UTC

    In general, it is difficult to deal with balanced delimiters using regexps alone (the problem arises when you have nested parenthesized expressions). You may want to take a look at Text::Balanced.

    the lowliest monk

      For this simple case, maybe a simple "do while you can" can do the job:

      #!/usr/local/bin/perl local $_="woaaa (nested (parens))"; while (s/\([^()]*\)//){ print "$_\n"; } __OUTPUT__ ouu (nested ) ouu

      But it's possible that I'm missing something...

        Try your code against $_="woaaa (nested (parens)))".

        Update: Well, that's not fair of me, because the parens are not balanced; but it does show that the general problem has wrinkles. Judging by follow-up posts it looks like the OP's problem is simple enough to be dealt with a plain regexp. Text::Balanced shines when you have to process large chunks of text (e.g. source code) with balanced expressions.

        the lowliest monk

Re: strip out anything inbetween brackets
by ww (Archbishop) on Apr 05, 2005 at 15:19 UTC

    In addition to the suggestions about, you may also wish to note that the "(" and ")" in your question are "parentheses" rather than "brackets" (which come in several flavors, including square, [ and ], curly, { and } and < and >). It's UNimportant in this case, but would be very significant in most code.

    Likewise, what you call "speech marks" are (if I understand you correctly) "quotation marks" or, loosely and idiomatically, "double quotes" (to distinguish from single quotes, " ' "

    and I second the above: get an account. You'll find the monastery a welcoming and helpful place.

      you may also wish to note that the "(" and ")" in your question are "parentheses" rather than "brackets"

      Don't be so dogmatic! I don't know where the original poster is from, but in everyday speech (and indeed punctuation manuals) in the UK "(" and ")" are indeed called "brackets"; the same may well be true in other places. The word "parentheses" is known in the UK, but it's rarely heard and somebody using it risks sounding pretentious.

      Likewise, what you call "speech marks" are (if I understand you correctly) "quotation marks" or, loosely and idiomatically, "double quotes"

      "Speech marks" is also a commonly understood term in the UK.

      TMTOWTDI! Other people may come from cultures which use different terms for some things. That's OK — as human beings we can cope with occasionally having to take a second longer to read an unfamiliar phrase. It certainly doesn't mean that 'your' terms are 'right' and the other person's are 'wrong'.

      Smylers

        smylers:

        You're right, of course. I, especially, should not fall into a regionalistic trap like that. but, on the other hand, I will cheerfully risk begin viewed as "pretentious" if that's the result of an effort to communicate clearly, in the language of the listener or reader.

        But, for my info, how does UK-English distinguish among parens, squarebrackets, angle-brackets and curly-brackets? (others offering distinctive regionalisms or national-useages encouraged too!)

        and, for what LITTLE it's worth, I do not recall hearing (as a child in Edinburgh) any teacher referring to "" as speech-marks.

      I'll take some issue with some of your linguistic pedantry. :)

      Parentheses are a type of bracket, surely. What do they do but bracket things? You point out that backets come in the square, curly, and angle flavours. Why leave out round flavour?

      What's wrong with using the term 'speech marks'? Speech is double quoted in almost all literature I've come across, and I certainly grew up referring to double quotes as such. To insist on calling them quotation marks seems odd, especially when you go on to talk about double and single 'quotes' ('quotes' almost certainly being an abbreviation of 'quotation marks'). I've always called single quotes single quotes, though.

      While I appreciate your intent to educate, I do think you're incorrect or misleading on some of these points. But do correct me if you think I'm wrong.

      I'd like to point out that I was doing some work while writing this reply, and smylers nipped in in front of me ;)

        oops. Now ya got me ranting (and nitpicking): despite the smile, "pedantry" (of which I can be guilty :>}) twists my arms and makes me disagree (on grounds of insuffient precision and breadth) with "Speech is double quoted in almost all literature I've come across,...."

        Specifically, printed versions of speech are customarily double-quoted except when that speech is both single- and double-quoted (a quote of another utterance) and, in fact, depending on the narrowness of your definition of speech (ie, if narrower than a (US?) legalism in which speech includes writing, and sometimes even throwing paint at a wall), then double-quotes are also used to indicate an utterance in words, regardless of the technique.

        [/rant]
                :<}
      While understanding the viewpoints of Jasper and Smylers, I have to say that ww makes a good post.

      I myself (an American English speaker) was somewhat confused by the usage of brackets as opposed to parentheses. I only smiled slightly at the use of "speech marks" - which simply looked odd to me.

      It is most understandable that certain cultures/languages say things differently (as noted by our two friends across the pond). However, I have learned that these things are, in fact, commonly used elsewhere simply because of the post. And while it is prideful to think this site belongs to any particular culture, I think ww was attempting a very polite way of gently suggesting an alternative, which was perhaps more in tune with "programmer speak" than "American English".

      In short... ++'s all around! :P
      --------------
      "But what of all those sweet words you spoke in private?"
      "Oh that's just what we call pillow talk, baby, that's all."
        The toughest part of any technology project is communicating. Communicating the original specifications can be excrusiating some times and communications among team members takes up much time to ensure that every idea has been effectively passed between team members.

        English is a wonderful language, it borrows from other languages whenever it feels like it. There is almost always more than one way (and often many) ways to say the same thing. It reminds me of a certain scripting language...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://444983]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (5)
As of 2024-04-19 16:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found