tocie has asked for the wisdom of the Perl Monks concerning the following question:

Example code:
my $string = q(
Outside.
{tag}
Inside level 1.
{tag}
Inside level 2.
{/tag}
Inside level 1.
{/tag}
Outside.
);

$string =~ s/\{tag\}(.+)\{\/tag\}/--Marked--\n$1\n--EndMarked--/gis;
Right, so that matches the first {tag} and the last {/tag}. But nothing inbetween. If I change the pattern to
/\{tag\}(.+?)\{\/tag\}/
it matches the first {tag} and the first {/tag}, but because the second {tag} was inside the block just matched, it's passed over when seeking the next {tag}.

Le sigh.

The only way I can find to get this thing working as desired is to wrap the pattern match in a while loop:
while($string =~ s/\{tag\}(.+)\{\/tag\}/--Marked--\n$1\n--EndMarked--/gis) { next; }
That's a sub-optimal solution for the problem at hand, where there are dozens of these specialized markup tags.

Is there a better/quicker/easier way to get this done? Use of *ANY* module not standard in 5.004 is out of the question due to the needs of the app's audience. (I was looking at Parse::RecDecent...)

I thank you for your wisdom.
  • Comment on Properly transforming strings with nested markup tags

Replies are listed 'Best First'.
Re: Properly transforming strings with nested markup tags
by merlyn (Sage) on Dec 29, 2001 at 08:36 UTC
    Use of *ANY* module not standard in 5.004 is out of the question due to the needs of the app's audience.
    Well, nearly all pure-Perl modules work with 5.004, so you can just include them with your app as part of the install. See "No excuses about not using CGI.pm" for strategies on making it one file even.

    -- Randal L. Schwartz, Perl hacker

      Well, we're already packaging our own CGI.pm, so... ;)

      Unfortunately, it's the 5.004 bit that's the limit, not the lack of modules. Parse::RecDecent, for instance, requires 5.005.
Re: Properly transforming strings with nested markup tags
by I0 (Priest) on Dec 29, 2001 at 10:09 UTC
    my $string = q( Outside. {tag} Inside level 1. {tag} Inside level 2. {/tag} Inside level 1. {/tag} Outside. ); (my $re=$string)=~s/((\{tag\})|(\{\/tag\})|.)/${[')','']}[!$3]\Q$1\E${ +['(','']}[!$2]/gs; @$ = (eval{$string=~/$re/}); die $@ if $@=~/unmatched/; $re = join("|",map{quotemeta}@$); print $string while $string=~s/\{tag\}($re)\{\/tag\}/--Marked--\n$1\n- +-EndMarked--/;
      Wow.

      Er, I think. Thank you. I'll go over that when my eyes focus. ;)
Re: Properly transforming strings with nested markup tags
by khkramer (Scribe) on Dec 29, 2001 at 10:54 UTC
    Here's another approach...
    # takes a marked-up string, a regular expression to determine which # tag(s) we're interested in, and a code reference which will do the # sub-string transformation. returns the modified string. sub parse_and_replace { my ( $string, $tag_match, $transform_sub ) = @_; my @context_stack; my %deferred_transforms; # loop matching tags of the form {word} and {/word} while ( $string =~ m!(\{(/)?(\w+)\})!g ) { my ( $tag, $tag_length ) = ( $3, length($1) ); my $is_close = $2 ? 1 : 0; if ( $is_close ) { # pop and possibly transform on finding a matching close tag # syntax check: properly nested? my $popped = pop @context_stack; if ( $tag ne $popped->{tag} ) { die "close '${\( $popped->{tag} )}' mis-matched with open '$ta +g'\n"; } # save start index and length of tag content if we match the # tag_to_match param. if ( $tag =~ /$tag_match/ ) { my $start = $popped->{pos}, my $length = pos($string) - $popped->{pos}; my $text = substr ( $string, $start, $length ); if ( ! $deferred_transforms{$text} ) { $deferred_transforms{$text} = $transform_sub->("$text"); } } } else { # just push onto the context stack on finding an open tag push @context_stack, { tag => $tag, pos => pos($string) - $tag_l +ength}; } } # now do the replacements my $error; foreach my $text ( keys %deferred_transforms ) { $string =~ s/$text/$deferred_transforms{$text}/g; } return $string; } # and to invoke: my $string = q( Outside. {tag} Inside level 1. {tag} Inside level 2. {/tag} Inside level 1. {/tag} Outside. ); my $sub = sub { $_[0] =~ s/\{tag\}(.+)\{\/tag\}/--Marked--\n$1\n--EndMarked--/gis; return $_[0]; }; print "RESULT: " . parse_and_replace ( $string, 'tag', $sub );

    I do think it's worth noting that parsing/manipulating recursively-nested markup is not trivial. (You should test the heck out of any custom-written solution -- including the one I just supplied -- before you even start to think about trusting it for your application.) I second Merlyn's advice about rolling modules into your distribution; Parse::RecDescent is powerful, flexible and de-bugged!

    It isn't possible, as you're discovering, to do this kind of parsing with simple regexps. Unless you're willing to put severe limits on allowed markup structure, you'll need to parse recursively (or cheat a bit and save some context, as my code does, and as IO's code does in a much niftier way).

    And parsing is only half the battle -- the transformation can be tricky, too. Unless you're willing to limit the kind of transformation that's allowed, you have to build a tree, do the transformations on each tree node, then put the tree back together into a string. (My sub above sidesteps tree-ization by limiting transformations to simple, stateless, one-to-one mappings between a given "{tag}content{/tag}" string and a "result" string.)

    Kwin
      I'm gonna go over the code you posted tomorrow morning... too much for my meager brain to parse error-free at this hour. ;)

      The markup tags are not complex, nor can they be twisted to make themselves complex... they just get completely mangled and nested from time to time.

      Unfortunately, the most immediate and obvious solutions are either ugly or are unavailable ... I.e. Parse::RecDecent requiring 5.005, when we need to run under 5.004.

      Anyway - thank you for your responses, Kwin and everyone!
(crazyinsomniac) Re: Properly transforming strings with nested markup tags
by crazyinsomniac (Prior) on Dec 29, 2001 at 17:22 UTC
    I did something like this before, and I really like to split. This may or may not suit your needs, but it's the way I'd approach it without Parse::RecDescent (which I really like). It's your stuff firs, then mine.
    #!/usr/bin/perl -wl ## note the -l flag use strict; ## your stuff my $string = q( Outside. {tag} Inside level 1. {tag} Inside level 2. {/tag} Inside level 1. {/tag} Outside. ); ## original input print $string; while($string =~ s/\{tag\}(.+)\{\/tag\}/--Marked--\n$1\n--EndMarked--/ +gis) { next; } ## transformed output print $string; ## my stuff print "#" x 69; my %tags = ( 'Tagged' => 'tag', 'Bagged' => 'bag', ,); $string = q( Outside. {tag} Inside level 1. {tag} Inside level 2. {bag} bag level 1, inside tag level 2 {/bag} {/tag} Inside level 1. {/tag} Outside. ); my $Sfactor = join '|', map {quotemeta} map {("{$_}","{/$_}")} values +%tags; my @thing = split m/($Sfactor)/, $string; for my $t(@thing) { for my $tag(keys %tags) { $Sfactor = quotemeta "{$tags{$tag}}"; $t =~ s/$Sfactor/--$tag--/; $Sfactor = "{/$tags{$tag}}"; $t =~ s/$Sfactor/--End$tag--/; } } $string = join '', @thing; ## back as a string print $string; __END__ F:\dev\>perl markup.pl Outside. {tag} Inside level 1. {tag} Inside level 2. {/tag} Inside level 1. {/tag} Outside. Outside. --Marked-- Inside level 1. --Marked-- Inside level 2. --EndMarked-- Inside level 1. --EndMarked-- Outside. ##################################################################### Outside. --Tagged-- Inside level 1. --Tagged-- Inside level 2. --Bagged-- bag level 1, inside tag level 2 --EndBagged-- --EndTagged-- Inside level 1. --EndTagged-- Outside. F:\dev\>

     
    ___crazyinsomniac_______________________________________
    Disclaimer: Don't blame. It came from inside the void

    perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"