Dr.Altaica has asked for the wisdom of the Perl Monks concerning the following question:

I to split a string at the first layer of nesting example take:
"[S [NP This NP] [VP is [NP [NP the turning point NP] [PP to [NP the left NP] PP] NP] VP] . S]"
and split it into:
("[NP This NP]","[VP is [NP [NP the turning point NP] [PP to [NP the left NP] PP] NP] VP]", "." )
I looked at Parse::RecDescent and Text::DelimMatch but those seem to want some sort of deliminator. I would like to use only in the standared distubution of Perl as the program is a Mobile agent. All the ']'s and '['s will thar won't be any excaped or otherwize not art of nesting.

Replies are listed 'Best First'.
Re: Non deliminatd Nested text
by japhy (Canon) on Oct 25, 2001 at 20:20 UTC
    Here's a nifty function for you. It requires Perl 5.6 to run; it uses a nested regex to match nested things. The second argument to get_brackets() is the depth you want.
    my $str = q{[S [NP This NP] [VP is [NP [PP to [NP the left NP] PP] NP] VP] . S +]}; my $bracket; $bracket = qr{ (?: (?> [^][]* ) | \[ (??{ $bracket }) \] )+ }x; for (get_brackets($str, 1)) { print "<$_>\n" } for (get_brackets($str, 2)) { print "<<$_>>\n" } sub get_brackets { my ($str, $depth) = @_; my @hits = $str; @hits = map m{\[($bracket)\]}g, @hits for 1 .. $depth; return map "[$_]", @hits; # put the []'s back in }

    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Re: Non deliminatd Nested text
by FoxtrotUniform (Prior) on Oct 25, 2001 at 20:53 UTC
Re: Non deliminatd Nested text
by gryphon (Abbot) on Oct 25, 2001 at 21:36 UTC

    Greetings Dr.Altaica,

    You have an interesting situation. I'm sure there must be a module out there that does nearly exactly what you want, but since you specifically asked for something that will work "only in the standared distubution of Perl," here's my code:

    #!/usr/bin/perl -w use strict; my $input = '[S [NP This NP] [VP is [NP [NP the turning point NP] ' . '[PP to [NP the left NP] PP] NP] VP] . S]'; $input =~ s/^\[S\s*(.*?)\s*S\]$/$1/; my $bracket_count = 0; my @output; my $build_var; foreach (split(//, $input)) { $build_var .= $_; if (/\[/) { $bracket_count++; } elsif (/\]/) { $bracket_count--; } if ($bracket_count == 0) { push @output, $build_var if ($build_var ne ' '); $build_var = ''; } } foreach (@output) { print '"', $_, '"', "\n"; }

    It's not fancy or nice, and probably will need to be patched a bit for situations that fall outside of your given example. However, it does seem to return the results that you want.

    -gryphon
    code('Perl') || die;

      Just in case, test to see whether $bracket_count ever goes to -1. In the case where your string is aaa]bbb[... you probably want to do something a little more drastic than pushing (a, a, a, ]bbb[) into the output array :)

      Thanks gryphon, not exactly what [I wanted] but that was my was'nt clear about what I wantd. I should have used an example like: '[VP This stuff is [NP the left NP] [NP other thing NP] VP]' into ("This stuff is", "[NP the left NP]", "[NP other thing NP]") Here's the code that works like a need incase someone else need to split a anchor deliminated strings(in this case the center of " [" and "] ")and ignore the nested ones.nested
      #!/usr/bin/perl -w use strict; #my $input = '[VP is [NP one NP] it [NP two NP] working VP]'; my $input = '[VP This stuff is [NP the left NP] [NP other thing NP] VP +]'; #my $input = '[VP [NP This NP] [VP is [NP [NP the turning point NP] [P +P to [NP the left NP] PP] NP] VP] . VP]'; #my $input = '[S [NP This NP] [VP is [NP [NP the turning point NP] ' . + # '[PP to [NP the left NP] PP] NP] VP] . S]'; $input =~ s/^\[\w+\s*(.*?)\s*\w+\]$/$1\n/; my $bracket_count = 0; my @output; my $build_var; foreach (split(//, $input)) { if (/\[/) { if ($bracket_count == 0) {#DR $build_var =~ s/^\s*|\s*$//g;#dr push @output, $build_var if ($build_var ne '') +;#DR $build_var = '';#DR }#DR $bracket_count++; $build_var .= $_;#dr } elsif (/\]/) { $bracket_count--; $build_var .= $_;#dr if ($bracket_count == 0) {#DR $build_var =~ s/^\s*|\s*$//g;#dr push @output, $build_var if ($build_var ne '') +;#DR $build_var = '';#DR }#DR } elsif (/\n/) {#dr Should be the end $build_var =~ s/^\s*|\s*$//g;#dr push @output, $build_var if ($build_var ne '');#DR } else {#dr $build_var .= $_;#dr } } foreach (@output) { print '"', $_, '"', "\n"; }
Re: Non deliminatd Nested text
by hsmyers (Canon) on Oct 26, 2001 at 01:32 UTC
    To demonstrate how to break on more than one delimiter (in this case blanks and curly braces), here is a twist on gryphon's code...

    sub split_delimited { my $input = shift; my $limit = shift; my @output; my $s = ''; my $t = ''; my $bracket_count = 0; my $build_var; $input =~ s/\n/ /gm; foreach ( split ( //, $input ) ) { $build_var .= $_; if (/\{/) { $bracket_count++; } elsif (/}/) { $bracket_count--; } elsif ( / / and $bracket_count == 0 ) { push @output, $build_var if ( $build_var and $build_var ne + ' ' ); $build_var = ''; } } foreach (@output) { if (length($s) < $limit) { $s .= $_; } else { $t .= $_; } } return $s,$t; }

    This returns two substrings without violating any of the formatting within the parent string, but still solving whatever problems being over the 'limit' would cause.

    As was pointed out, there are a number of module approaches to this problem, but given the constraint you mentioned, I wouldn't hesitate to use this code— In fact I did, thanks gryphon!

    hsm

    BTW, your example is delimited, but I think I know what you meant…

Re: Non deliminatd Nested text
by I0 (Priest) on Jun 24, 2002 at 13:28 UTC
    $_ = '[S [NP This NP] [VP is [NP [NP the turning point NP] [PP to [NP +the left NP] PP] NP] VP] . S]'; #$_ = '[VP This stuff is [NP the left NP] [NP other thing NP] VP]'; @( = ('(',''); @) = (')',''); ($re=$_)=~s/((\[\w+\s*)|(\s*\w+\])|.)/$)[!$3]\Q$1\E$([!$2]/gs; $re = join'|',map{quotemeta}eval{/$re/}; die $@ if $@=~/unmatched/; $_ = (eval{/($re)/})[0]; print join"\n\n",/\s*(\[\w+\s*(?:$re)\s*\w+\]|[^][]*)/g,"";