rajaman has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I am trying to split a string delimited at the longest balanced-bracket, i.e., in a greedy way.

For example, I want to split the string shown below into two parts: i) 'The use of parentheses.', and ii) '(indicates that the (writer [considered] the {information}) less <important-almost> an afterthought)'.

While part ii) could be obtained using the code below, I am not sure how to get the remaining part of the string, i.e., part i). I also tried to work with 'Text::Balanced' module but couldn't make it work.

Thank you for suggestions.

use Regexp::Common 'RE_ALL'; my $string='The use of parentheses (indicates that the (writer [consid +ered] the {information}) less <important—almost> an afterthought).'; $string =~ RE_balanced(-parens=>'(){}[]<>'} and print "$1\n";

Replies are listed 'Best First'.
Re: bracket processing
by kcott (Archbishop) on Mar 31, 2020 at 04:09 UTC

    G'day rajaman,

    "I also tried to work with 'Text::Balanced' module but couldn't make it work."

    If you were more specific about what "make it work" means, and also posted the code you tried, we could probably be more helpful. Here's a couple of examples of what you could have done.

    #!/usr/bin/env perl use strict; use warnings; use Text::Balanced 'extract_bracketed'; my $delim = '([{<'; my $prefix = qr{[^$delim]*}; my $string = 'The use of parentheses (indicates that the (writer [cons +idered] the {information}) less <important—almost> an afterthought).' +; my @parts = extract_bracketed($string, $delim, $prefix); print " i) $parts[2]$parts[1]\n"; print "ii) $parts[0]\n"; my ($trimmed_start) = $parts[2] =~ /^(.*?)\s*$/; print " I) $trimmed_start$parts[1]\n"; print "II) $parts[0]\n";

    Output:

    i) The use of parentheses . ii) (indicates that the (writer [considered] the {information}) less < +important—almost> an afterthought) I) The use of parentheses. II) (indicates that the (writer [considered] the {information}) less < +important—almost> an afterthought)

    The first two results (i & ii) are what I believe you asked for but not what your expected output showed: note the extra space at the end (parentheses .).

    The second two results (I & II) show how you might trim that extra space. This looks more like the expected output you show but doesn't exactly follow your description.

    It would be helpful if you presented your expected output within <code>...</code> tags, so that we can see more clearly exactly what you want (HTML paragraph rendering is not always faithful to the original text).

    It would also help if you supplied a range of much shorter input samples, along with your expected output for these. Consider edge cases: no text before the first bracket; no text after the last bracket; unbalanced brackets in various places; and so on.

    See also: Text::Balanced.

    — Ken

      my $delim = '([{<';
      my $prefix = qr{[^$delim]*};

      As a general practice, I find it's much safer to interpolate strings like  $delim into regexes using  \Q \E metaquote escapes:
          my $delim = '([{<';
          my $prefix = qr{[^\Q$delim\E]*};
      Of course, one could metaquote the string variable upon definition:
          my $delim = quotemeta '([{<';
      but that might screw up subsequent use of the string; e.g., its use in something like
          my @parts = extract_bracketed($string, $delim, $prefix);
      might become problematic.


      Give a man a fish:  <%-{-{-{-<

        "As a general practice, I find it's much safer to interpolate strings ... into regexes using \Q \E ..."

        As a general rule, for regexes in general, that's fine and I'd generally do the same; however, bracketed classes are different.

        Take a look at "perlrecharclass: Special Characters Inside a Bracketed Character Class". I'll leave you to acquaint yourself with the full text. Here's some pertinent extracts (my emphasis added):

        Most characters that are meta characters in regular expressions ... lose their special meaning and can be used inside a character class without the need to escape them.
        ...
        Characters that may carry a special meaning inside a character class are: \ , ^ , - , [ and ] , and are discussed below.
        ...
        A [ is not special inside a character class, unless it's the start of a POSIX character class ... It normally does not need escaping.

        So, none of the characters in $delim required escaping.

        Furthermore, I generally aim to thoroughly test my solutions before posting them. In this instance, I had added a temporary print statement:

        my $prefix = qr{[^$delim]*}; print "$prefix\n";

        which output:

        (?^:[^([{<]*)

        That's exactly the regex I wanted.

        — Ken

Re: bracket processing
by 1nickt (Canon) on Mar 31, 2020 at 02:04 UTC

    Hi,

    Just remove the matching string (and whatever's left over) when you find that the RE matches. The matched string will be in $1 as normal and the original string will be trimmed to just the prefix.

    use strict; use warnings; use 5.010; use Regexp::Common 'RE_balanced'; my $string = 'The use of parentheses (indicates that the ' . '(writer [considered] the {information}) less ' . '<important—almost> an afterthought).'; my $match = RE_balanced( -parens => '(){}[]<>' ); $string =~ s/${match}.*//; $1 and say ">$_<" for $string, $1;
    Output:
    $ perl 11114819.pl >The use of parentheses < >(indicates that the (writer [considered] the {information}) less <imp +ortant—almost> an afterthought)<

    Hope this helps!


    The way forward always starts with a minimal test.

      Update: Per haukex's query, the version of Regexp::Common::balanced I'm using is 2010010201, so yes, I'm a bit behind the times and the behavior of 1nickt's code is not unexpected.

      c:\@Work\Perl\monks>perl -wMstrict -le "use Regexp::Common qw(balanced); print $Regexp::Common::balanced::VERSION; " 2010010201


      Are you sure about that code? I can only generate the given output when I add the  -keep switch to the  RE_balanced() call.

      Without -keep:

      c:\@Work\Perl\monks>perl -wMstrict -le "use strict; use warnings; use 5.010; use Regexp::Common 'RE_balanced'; my $string = 'The use of parentheses (indicates that the ' . '(writer [considered] the {information}) less ' . '<importantalmost> an afterthought).'; my $match = RE_balanced( -parens => '(){}[]<>' ); $string =~ s/${match}.*//; $1 and say \"^>$_^<\" for $string, $1; "
      With -keep:
      c:\@Work\Perl\monks>perl -wMstrict -le "use strict; use warnings; use 5.010; use Regexp::Common 'RE_balanced'; my $string = 'The use of parentheses (indicates that the ' . '(writer [considered] the {information}) less ' . '<importantalmost> an afterthought).'; my $match = RE_balanced( -parens => '(){}[]<>', -keep ); $string =~ s/${match}.*//; $1 and say \"^>$_^<\" for $string, $1; " >The use of parentheses < >(indicates that the (writer [considered] the {information}) less <imp +ortantalmost> an afterthought)<


      Give a man a fish:  <%-{-{-{-<

        I can only generate the given output when I add the -keep switch to the RE_balanced() call.

        What version are you using? From Regexp::Common::balanced:

        Since version 2013030901, $1 will always be set (to the entire matched substring), regardless whether {-keep} is used or not.
Re: bracket processing
by LanX (Saint) on Mar 31, 2020 at 01:26 UTC
    I'm not sure about your "longest greedy" requirement and if it really had to be one single regex.

    I'd go for a KISS approach to apply multiple replacements of non nested pairs with placeholders.

    0: The use of parentheses (indicates that the (writer [considered] the {information}) less <important—almost> an afterthought).

    1: The use of parentheses (indicates that the (writer  %0% the %1%) less %2% an afterthought).

    2: The use of parentheses (indicates that the %3% less %2% an afterthought).

    3: The use of parentheses %4%.

    This can be done by repeating one simple regex over and over and storing the matches in an array. Afterwards you just need to reconstruct the tree again.

    NB: I just used %n% for visualization. Using something like \0 is far better here.

    HTH :)

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

Re: bracket processing
by AnomalousMonk (Archbishop) on Mar 31, 2020 at 06:21 UTC
Re: bracket processing
by rajaman (Sexton) on Mar 31, 2020 at 20:56 UTC
    Thank you Monks, great points all!

    As suggested by @Ken and @AnonymousMonk, let me elaborate the problem. The overall goal is to remove all types of brackets from 'noisy' text (e.g html content/tweets etc.), thereby 'sanitize' text. The brackets may appear in text in any number and in any form (edge cases), the idea is to remove content from within all non-overapping, longest-extending, balanced brackets regardless of their types. Strings that have unbalanced brackets can be ignored. Flanking characters of brackets may be among (\s or \. or \; or \: or \,).

    The below script from Ken's & AnonymousMonk's suggestions works well if there is just one 'big' bracket, as:

    Program:
    ####program.pl #!/usr/bin/perl use strict; use warnings; use Text::Balanced 'extract_bracketed'; my $delim = '([{<'; my $prefix = qr{[^\Q$delim\E]*}; my $string = 'The use of parentheses (indicates that the (writer [cons +idered] the {information}) less <important—almost> an afterthought).' +; my @parts = extract_bracketed($string, $delim, $prefix); $parts[2]=~s/\s*$//; print WF1 "pattern:\'$parts[0]\'\n"; print WF1 "rightside of pattern:\'$parts[1]\'\n"; print WF1 "leftside of pattern:\'$parts[2]\'\n";
    Output:
    pattern:'(indicates that the (writer [considered] the {information}) l +ess <important—almost> an afterthought)' rightside of pattern:'.' leftside of pattern:'The use of parentheses'
    However, when another non-overlapping bracket appears in the string (e.g. '(use {of})') as shown below, the above script removes just one, as shown below:
    $string = 'The (use {of}) parentheses (indicates that the (writer [con +sidered] the {information}) less <important—almost> an afterthought). +'; Output: pattern:'(use {of})' rightside of pattern:' parentheses (indicates that the (writer [consid +ered] the {information}) less <important—almost> an afterthought).' leftside of pattern:'The'
    The desired output though should have the second bracket also removed, something along the lines:

    Desired output:'The parentheses.';
    Pattern1removed:'(use {of})'
    Pattern2removed:'(indicates that the (writer [considered] the {information}) less <important—almost> an afterthought)'

    How can such cases be addressed?

    Thanks again your help!

      You could just use a loop, dealing with each bracketed part that is encountered. Here's a basic technique:

      #!/usr/bin/env perl use strict; use warnings; use constant { EXTRACTED => 0, SUFFIX => 1, PREFIX => 2, }; use Text::Balanced 'extract_bracketed'; my $delim = '([{<'; my $prefix = qr{[^$delim]*}; my $string = 'The (use {of}) parentheses (indicates that the (writer [ +considered] the {information}) less <important—almost> an afterthough +t).'; my ($current, $wanted) = ($string, ''); while (1) { my @parts = extract_bracketed($current, $delim, $prefix); if (defined $parts[PREFIX]) { my ($trimmed_start) = $parts[PREFIX] =~ /^(.*?)\s*$/; $wanted .= $trimmed_start; } if (defined $parts[EXTRACTED]) { $current = $parts[SUFFIX]; } else { $wanted .= $parts[SUFFIX]; last; } } print "$wanted\n";

      Output:

      The parentheses.

      I'll leave you to integrate that technique into your actual production code.

      [The lack of sample data — as requested by myself and expanded upon by AnomalousMonk — was disappointing.]

      — Ken

        Excellent, thanks much Ken. That helps! Sorry I missed samples, attaching a few here.

        I am starting to learn text analysis (e.g. sentiment analysis on reviews/opinion). Here is the link to the full dataset: http://www.cs.jhu.edu/~mdredze/datasets/sentiment/ . Many are also available on google.

        Few samples are shown below. Seems text processing is far more complex than I realized. Lots of variations/edge cases. Just came across cases where brackets can't be removed as it makes the text pretty ackward, something like: 'Company is good because of: (1) quick service, (2) product range..'.

        #samples The largest of these (4 quart{16 cups}) will hold all but a little bit + of a bag of flour. I like the look of the stainless steel and the fa +ct of the seal around the lid. I {mistakenly} did not recieve my order (it was a different order i wa +s expecting), and they rushed me another collar. The rougher the machine the more wear on the fabric and more pilling, +lint, etc.... I would highly recommend these sheets for the price an +d the great quality for the price. [I would rate them 5 stars if the +y were heavier and soft as cashmere, but for the money definitely a g +ood buy.]