bracket processing

rajaman has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: bracket processing by kcott (Archbishop) on Mar 31, 2020 at 04:09 UTC
G'day rajaman, "I also tried to work with 'Text::Balanced' module but couldn't make it work." If you were more specific about what "make it work" means, and also posted the code you tried, we could probably be more helpful. Here's a couple of examples of what you could have done. #!/usr/bin/env perl use strict; use warnings; use Text::Balanced 'extract_bracketed'; my $delim = '([{<'; my $prefix = qr{[^$delim]}; my $string = 'The use of parentheses (indicates that the (writer [cons +idered] the {information}) less <important—almost> an afterthought).' +; my @parts = extract_bracketed($string, $delim, $prefix); print " i) $parts[2]$parts[1]\n"; print "ii) $parts[0]\n"; my ($trimmed_start) = $parts[2] =~ /^(.?)\s$/; print " I) $trimmed_start$parts[1]\n"; print "II) $parts[0]\n"; [download] Output: `i) The use of parentheses . ii) (indicates that the (writer [considered] the {information}) less < +important—almost> an afterthought) I) The use of parentheses. II) (indicates that the (writer [considered] the {information}) less < +important—almost> an afterthought)` [download] The first two results (i & ii) are what I believe you asked for but not what your expected output showed: note the extra space at the end (`parentheses .`). The second two results (I & II) show how you might trim that extra space. This looks more like the expected output you show but doesn't exactly follow your description. It would be helpful if you presented your expected output within `<code>...</code>` tags, so that we can see more clearly exactly what you want (HTML paragraph rendering is not always faithful to the original text). It would also help if you supplied a range of much shorter* input samples, along with your expected output for these. Consider edge cases: no text before the first bracket; no text after the last bracket; unbalanced brackets in various places; and so on. See also: Text::Balanced. — Ken	[reply] [d/l] [select]
Re^2: bracket processing by AnomalousMonk (Archbishop) on Mar 31, 2020 at 06:09 UTC
`my $delim = '([{<';` `my $prefix = qr{[^$delim]};` As a general practice, I find it's much safer to interpolate strings like `$delim` into regexes using `\Q \E` metaquote escapes: `my $delim = '([{<';` `my $prefix = qr{[^\Q$delim\E]};` Of course, one could metaquote the string variable upon definition: `my $delim = quotemeta '([{<';` but that might screw up subsequent use of the string; e.g., its use in something like `my @parts = extract_bracketed($string, $delim, $prefix);` might become problematic. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^3: bracket processing by kcott (Archbishop) on Apr 01, 2020 at 00:37 UTC
"As a general practice, I find it's much safer to interpolate strings ... into regexes using `\Q` `\E` ..." As a general rule, for regexes in general, that's fine and I'd generally do the same; however, bracketed classes are different. Take a look at "perlrecharclass: Special Characters Inside a Bracketed Character Class". I'll leave you to acquaint yourself with the full text. Here's some pertinent extracts (my emphasis added): Most characters that are meta characters in regular expressions ... lose their special meaning and can be used inside a character class without the need to escape them. ... Characters that may carry a special meaning inside a character class are: `\` , `^` , `-` , `[` and `]` , and are discussed below. ... A `[` is not special inside a character class, unless it's the start of a POSIX character class ... It normally does not need escaping. So, none of the characters in `$delim` required escaping. Furthermore, I generally aim to thoroughly test my solutions before posting them. In this instance, I had added a temporary `print` statement: `my $prefix = qr{[^$delim]}; print "$prefix\n";` [download] which output: `(?^:[^([{<])` [download] That's exactly the regex I wanted. — Ken	[reply] [d/l] [select]
Re^4: bracket processing by AnomalousMonk (Archbishop) on Apr 01, 2020 at 18:23 UTC
Re^5: bracket processing by kcott (Archbishop) on Apr 02, 2020 at 01:44 UTC
Some notes below your chosen depth have not been shown here
Re: bracket processing by 1nickt (Canon) on Mar 31, 2020 at 02:04 UTC
Hi, Just remove the matching string (and whatever's left over) when you find that the RE matches. The matched string will be in `$1` as normal and the original string will be trimmed to just the prefix. `use strict; use warnings; use 5.010; use Regexp::Common 'RE_balanced'; my $string = 'The use of parentheses (indicates that the ' . '(writer [considered] the {information}) less ' . '<important—almost> an afterthought).'; my $match = RE_balanced( -parens => '(){}[]<>' ); $string =~ s/${match}.//; $1 and say ">$_<" for $string, $1;` [download] Output: `$ perl 11114819.pl >The use of parentheses < >(indicates that the (writer [considered] the {information}) less <imp +ortant—almost> an afterthought)<` [download] Hope this helps! The way forward always starts with a minimal test.*	[reply] [d/l] [select]
Re^2: bracket processing (updated) by AnomalousMonk (Archbishop) on Mar 31, 2020 at 05:48 UTC
Update: Per haukex's query, the version of Regexp::Common::balanced I'm using is 2010010201, so yes, I'm a bit behind the times and the behavior of 1nickt's code is not unexpected. `c:\@Work\Perl\monks>perl -wMstrict -le "use Regexp::Common qw(balanced); print $Regexp::Common::balanced::VERSION; " 2010010201` [download] Are you sure about that code? I can only generate the given output when I add the `-keep` switch to the `RE_balanced()` call. Without `-keep`: `c:\@Work\Perl\monks>perl -wMstrict -le "use strict; use warnings; use 5.010; use Regexp::Common 'RE_balanced'; my $string = 'The use of parentheses (indicates that the ' . '(writer [considered] the {information}) less ' . '<importantalmost> an afterthought).'; my $match = RE_balanced( -parens => '(){}[]<>' ); $string =~ s/${match}.//; $1 and say \"^>$_^<\" for $string, $1; "` [download] With `-keep`: c:\@Work\Perl\monks>perl -wMstrict -le "use strict; use warnings; use 5.010; use Regexp::Common 'RE_balanced'; my $string = 'The use of parentheses (indicates that the ' . '(writer [considered] the {information}) less ' . '<importantalmost> an afterthought).'; my $match = RE_balanced( -parens => '(){}[]<>', -keep ); $string =~ s/${match}.//; $1 and say \"^>$_^<\" for $string, $1; " >The use of parentheses < >(indicates that the (writer [considered] the {information}) less <imp +ortantalmost> an afterthought)< [download] Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^3: bracket processing by haukex (Archbishop) on Mar 31, 2020 at 07:25 UTC
I can only generate the given output when I add the `-keep` switch to the `RE_balanced()` call. What version are you using? From Regexp::Common::balanced: Since version 2013030901, `$1` will always be set (to the entire matched substring), regardless whether `{-keep}` is used or not.	[reply] [d/l] [select]
Re: bracket processing by LanX (Saint) on Mar 31, 2020 at 01:26 UTC
I'm not sure about your "longest greedy" requirement and if it really had to be one single regex. I'd go for a KISS approach to apply multiple replacements of non nested pairs with placeholders. 0: `The use of parentheses (indicates that the (writer [considered] the {information}) less <important—almost> an afterthought).` 1: `The use of parentheses (indicates that the (writer %0% the %1%) less %2% an afterthought).` 2: `The use of parentheses (indicates that the %3% less %2% an afterthought).` 3: `The use of parentheses %4%.` This can be done by repeating one simple regex over and over and storing the matches in an array. Afterwards you just need to reconstruct the tree again. NB: I just used %n% for visualization. Using something like \0 is far better here. HTH :) Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re: bracket processing by AnomalousMonk (Archbishop) on Mar 31, 2020 at 06:21 UTC
rajaman: Further to this point in kcott's post: `[kcott]:` It would also help if you supplied a range of much shorter input samples, along with your expected output for these. Consider edge cases: no text before the first bracket; no text after the last bracket; unbalanced brackets in various places; and so on. The article How to ask better questions using Test::More and sample data is a very useful elaboration on this approach. Short, Self-Contained, Correct Example is a good read also. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re: bracket processing by rajaman (Sexton) on Mar 31, 2020 at 20:56 UTC
Thank you Monks, great points all! As suggested by @Ken and @AnonymousMonk, let me elaborate the problem. The overall goal is to remove all types of brackets from 'noisy' text (e.g html content/tweets etc.), thereby 'sanitize' text. The brackets may appear in text in any number and in any form (edge cases), the idea is to remove content from within all non-overapping, longest-extending, balanced brackets regardless of their types. Strings that have unbalanced brackets can be ignored. Flanking characters of brackets may be among (\s or \. or \; or \: or \,). The below script from Ken's & AnonymousMonk's suggestions works well if there is just one 'big' bracket, as: Program: ####program.pl #!/usr/bin/perl use strict; use warnings; use Text::Balanced 'extract_bracketed'; my $delim = '([{<'; my $prefix = qr{[^\Q$delim\E]}; my $string = 'The use of parentheses (indicates that the (writer [cons +idered] the {information}) less <important—almost> an afterthought).' +; my @parts = extract_bracketed($string, $delim, $prefix); $parts[2]=~s/\s$//; print WF1 "pattern:\'$parts[0]\'\n"; print WF1 "rightside of pattern:\'$parts[1]\'\n"; print WF1 "leftside of pattern:\'$parts[2]\'\n"; [download] Output: `pattern:'(indicates that the (writer [considered] the {information}) l +ess <important—almost> an afterthought)' rightside of pattern:'.' leftside of pattern:'The use of parentheses'` [download] However, when another non-overlapping bracket appears in the string (e.g. '(use {of})') as shown below, the above script removes just one, as shown below: `$string = 'The (use {of}) parentheses (indicates that the (writer [con +sidered] the {information}) less <important—almost> an afterthought). +'; Output: pattern:'(use {of})' rightside of pattern:' parentheses (indicates that the (writer [consid +ered] the {information}) less <important—almost> an afterthought).' leftside of pattern:'The'` [download] The desired output though should have the second bracket also removed, something along the lines: Desired output:'The parentheses.'; Pattern1removed:'(use {of})' Pattern2removed:'(indicates that the (writer [considered] the {information}) less <important—almost> an afterthought)' How can such cases be addressed? Thanks again your help!	[reply] [d/l] [select]
Re^2: bracket processing by kcott (Archbishop) on Apr 01, 2020 at 02:31 UTC
You could just use a loop, dealing with each bracketed part that is encountered. Here's a basic technique: #!/usr/bin/env perl use strict; use warnings; use constant { EXTRACTED => 0, SUFFIX => 1, PREFIX => 2, }; use Text::Balanced 'extract_bracketed'; my $delim = '([{<'; my $prefix = qr{[^$delim]}; my $string = 'The (use {of}) parentheses (indicates that the (writer [ +considered] the {information}) less <important—almost> an afterthough +t).'; my ($current, $wanted) = ($string, ''); while (1) { my @parts = extract_bracketed($current, $delim, $prefix); if (defined $parts[PREFIX]) { my ($trimmed_start) = $parts[PREFIX] =~ /^(.?)\s$/; $wanted .= $trimmed_start; } if (defined $parts[EXTRACTED]) { $current = $parts[SUFFIX]; } else { $wanted .= $parts[SUFFIX]; last; } } print "$wanted\n"; [download] Output: `The parentheses.` [download] I'll leave you to integrate that technique* into your actual production code. [The lack of sample data — as requested by myself and expanded upon by AnomalousMonk — was disappointing.] — Ken	[reply] [d/l] [select]
Re^3: bracket processing by rajaman (Sexton) on Apr 02, 2020 at 02:16 UTC
Excellent, thanks much Ken. That helps! Sorry I missed samples, attaching a few here. I am starting to learn text analysis (e.g. sentiment analysis on reviews/opinion). Here is the link to the full dataset: http://www.cs.jhu.edu/~mdredze/datasets/sentiment/ . Many are also available on google. Few samples are shown below. Seems text processing is far more complex than I realized. Lots of variations/edge cases. Just came across cases where brackets can't be removed as it makes the text pretty ackward, something like: 'Company is good because of: (1) quick service, (2) product range..'. #samples The largest of these (4 quart{16 cups}) will hold all but a little bit + of a bag of flour. I like the look of the stainless steel and the fa +ct of the seal around the lid. I {mistakenly} did not recieve my order (it was a different order i wa +s expecting), and they rushed me another collar. The rougher the machine the more wear on the fabric and more pilling, +lint, etc.... I would highly recommend these sheets for the price an +d the great quality for the price. [I would rate them 5 stars if the +y were heavier and soft as cashmere, but for the money definitely a g +ood buy.] [download]	[reply] [d/l]
Re^4: bracket processing by AnomalousMonk (Archbishop) on Apr 02, 2020 at 04:09 UTC
Re^5: bracket processing by rajaman (Sexton) on Apr 05, 2020 at 21:00 UTC