in reply to Properly transforming strings with nested markup tags

Here's another approach...
# takes a marked-up string, a regular expression to determine which # tag(s) we're interested in, and a code reference which will do the # sub-string transformation. returns the modified string. sub parse_and_replace { my ( $string, $tag_match, $transform_sub ) = @_; my @context_stack; my %deferred_transforms; # loop matching tags of the form {word} and {/word} while ( $string =~ m!(\{(/)?(\w+)\})!g ) { my ( $tag, $tag_length ) = ( $3, length($1) ); my $is_close = $2 ? 1 : 0; if ( $is_close ) { # pop and possibly transform on finding a matching close tag # syntax check: properly nested? my $popped = pop @context_stack; if ( $tag ne $popped->{tag} ) { die "close '${\( $popped->{tag} )}' mis-matched with open '$ta +g'\n"; } # save start index and length of tag content if we match the # tag_to_match param. if ( $tag =~ /$tag_match/ ) { my $start = $popped->{pos}, my $length = pos($string) - $popped->{pos}; my $text = substr ( $string, $start, $length ); if ( ! $deferred_transforms{$text} ) { $deferred_transforms{$text} = $transform_sub->("$text"); } } } else { # just push onto the context stack on finding an open tag push @context_stack, { tag => $tag, pos => pos($string) - $tag_l +ength}; } } # now do the replacements my $error; foreach my $text ( keys %deferred_transforms ) { $string =~ s/$text/$deferred_transforms{$text}/g; } return $string; } # and to invoke: my $string = q( Outside. {tag} Inside level 1. {tag} Inside level 2. {/tag} Inside level 1. {/tag} Outside. ); my $sub = sub { $_[0] =~ s/\{tag\}(.+)\{\/tag\}/--Marked--\n$1\n--EndMarked--/gis; return $_[0]; }; print "RESULT: " . parse_and_replace ( $string, 'tag', $sub );

I do think it's worth noting that parsing/manipulating recursively-nested markup is not trivial. (You should test the heck out of any custom-written solution -- including the one I just supplied -- before you even start to think about trusting it for your application.) I second Merlyn's advice about rolling modules into your distribution; Parse::RecDescent is powerful, flexible and de-bugged!

It isn't possible, as you're discovering, to do this kind of parsing with simple regexps. Unless you're willing to put severe limits on allowed markup structure, you'll need to parse recursively (or cheat a bit and save some context, as my code does, and as IO's code does in a much niftier way).

And parsing is only half the battle -- the transformation can be tricky, too. Unless you're willing to limit the kind of transformation that's allowed, you have to build a tree, do the transformations on each tree node, then put the tree back together into a string. (My sub above sidesteps tree-ization by limiting transformations to simple, stateless, one-to-one mappings between a given "{tag}content{/tag}" string and a "result" string.)

Kwin

Replies are listed 'Best First'.
Re: Re: Properly transforming strings with nested markup tags
by tocie (Novice) on Dec 29, 2001 at 12:09 UTC
    I'm gonna go over the code you posted tomorrow morning... too much for my meager brain to parse error-free at this hour. ;)

    The markup tags are not complex, nor can they be twisted to make themselves complex... they just get completely mangled and nested from time to time.

    Unfortunately, the most immediate and obvious solutions are either ugly or are unavailable ... I.e. Parse::RecDecent requiring 5.005, when we need to run under 5.004.

    Anyway - thank you for your responses, Kwin and everyone!