in reply to bracket processing

G'day rajaman,

"I also tried to work with 'Text::Balanced' module but couldn't make it work."

If you were more specific about what "make it work" means, and also posted the code you tried, we could probably be more helpful. Here's a couple of examples of what you could have done.

#!/usr/bin/env perl use strict; use warnings; use Text::Balanced 'extract_bracketed'; my $delim = '([{<'; my $prefix = qr{[^$delim]*}; my $string = 'The use of parentheses (indicates that the (writer [cons +idered] the {information}) less <important—almost> an afterthought).' +; my @parts = extract_bracketed($string, $delim, $prefix); print " i) $parts[2]$parts[1]\n"; print "ii) $parts[0]\n"; my ($trimmed_start) = $parts[2] =~ /^(.*?)\s*$/; print " I) $trimmed_start$parts[1]\n"; print "II) $parts[0]\n";

Output:

i) The use of parentheses . ii) (indicates that the (writer [considered] the {information}) less < +important—almost> an afterthought) I) The use of parentheses. II) (indicates that the (writer [considered] the {information}) less < +important—almost> an afterthought)

The first two results (i & ii) are what I believe you asked for but not what your expected output showed: note the extra space at the end (parentheses .).

The second two results (I & II) show how you might trim that extra space. This looks more like the expected output you show but doesn't exactly follow your description.

It would be helpful if you presented your expected output within <code>...</code> tags, so that we can see more clearly exactly what you want (HTML paragraph rendering is not always faithful to the original text).

It would also help if you supplied a range of much shorter input samples, along with your expected output for these. Consider edge cases: no text before the first bracket; no text after the last bracket; unbalanced brackets in various places; and so on.

See also: Text::Balanced.

— Ken

Replies are listed 'Best First'.
Re^2: bracket processing
by AnomalousMonk (Archbishop) on Mar 31, 2020 at 06:09 UTC
    my $delim = '([{<';
    my $prefix = qr{[^$delim]*};

    As a general practice, I find it's much safer to interpolate strings like  $delim into regexes using  \Q \E metaquote escapes:
        my $delim = '([{<';
        my $prefix = qr{[^\Q$delim\E]*};
    Of course, one could metaquote the string variable upon definition:
        my $delim = quotemeta '([{<';
    but that might screw up subsequent use of the string; e.g., its use in something like
        my @parts = extract_bracketed($string, $delim, $prefix);
    might become problematic.


    Give a man a fish:  <%-{-{-{-<

      "As a general practice, I find it's much safer to interpolate strings ... into regexes using \Q \E ..."

      As a general rule, for regexes in general, that's fine and I'd generally do the same; however, bracketed classes are different.

      Take a look at "perlrecharclass: Special Characters Inside a Bracketed Character Class". I'll leave you to acquaint yourself with the full text. Here's some pertinent extracts (my emphasis added):

      Most characters that are meta characters in regular expressions ... lose their special meaning and can be used inside a character class without the need to escape them.
      ...
      Characters that may carry a special meaning inside a character class are: \ , ^ , - , [ and ] , and are discussed below.
      ...
      A [ is not special inside a character class, unless it's the start of a POSIX character class ... It normally does not need escaping.

      So, none of the characters in $delim required escaping.

      Furthermore, I generally aim to thoroughly test my solutions before posting them. In this instance, I had added a temporary print statement:

      my $prefix = qr{[^$delim]*}; print "$prefix\n";

      which output:

      (?^:[^([{<]*)

      That's exactly the regex I wanted.

      — Ken

        ... for regexes in general, that's fine ... bracketed classes are different. ... Most characters that are meta characters in regular expressions ... lose their special meaning ... none of the characters in $delim required escaping.

        In a character class, no one can hear your metacharacters scream. For the most part. The consequences of the occasional exception are what I seek to avoid with defensive measures like this. The effects of a change from
            my $delim = '([{<';
        to
            my $delim = '(-[{<';
        may not be readily apparent, yet still be very significant. One would hope that thorough testing would reveal a problem like this, but better IMHO to obviate the problem to begin with.


        Give a man a fish:  <%-{-{-{-<