kaatunut has asked for the wisdom of the Perl Monks concerning the following question:

Trivialized problem:

Convert string so that every 'a' between "<bb>...</bb>" is turned into string "bb" and everything between "<U>...</U>" is uppercased.

"aa<bb>aa<U>aa</bb>aa</U>aa" => "aabbbbBBBBAAaa"

Approach problem: I could go first through all </?bb> tags, apply conversions between them and then kill the tags. But that sounds ineffective with longer strings and this is trivialization (sp?) anyways, so I'll just say I want to do this with only one run-through the string, so that the data that's given to processing routine is something like:

"aaaaaaaa", b from offset 2 to 6, U from offset 4 to 8

This where I have a problem; when I apply one conversion, it might invalidate existing offsets inside a string; for example, if I apply the first "b" region, string becomes as following: "aabbbbbbbbaa", causing the 'U' region begin after the 2nd 'b' when it should begin after the 4th 'b'.

I can avoid this sometimes by applying the last conversions first, but if the two regions overlap, this isn't feasible.

You see, with more complicated rules, the real location of 'U' tag becomes quite unpredictable.

What I'd really like would be some sort of magical null-byte that didn't show up on actual representation of string, but hanged around inside the string, moving around like is proper with the character inserts an deletes. And in this spirit I, of course, could add something like '<foo>' to the string temporarily as this hangaround tag, and remove it afterwards, but that's a bit unclean (not to mention dangerous/tricky; what if <foo> already is there?)


Bottom line: if you can think of way to fix the problems with my initial implementation idea or come up with better one for the problem above, one that doesn't iterate through actual tags inside the string many times, I'd be one happy initiate.

  • Comment on Metatag processing (overlapping regions)

Replies are listed 'Best First'.
Re: Metatag processing (overlapping regions)
by clemburg (Curate) on Nov 11, 2000 at 23:40 UTC

    I don't think there is a single 'correct' solution to your problem. If the tag regions overlap, the resulting string will be dependent on the order of actions that you want to apply to your string. No way out here in the general case.

    On a more pragmatical note, I doubt it would be wise to try to do it in a single run through the string. All you achieve is headaches figuring out which order of actions is imposed on you by the complicated algorithm you will come up with. Rather make this order explicit, and do several runs through the string. This way, it will at least be explicit what happens.

    Christian Lemburg
    Brainbench MVP for Perl
    http://www.brainbench.com

(tye)Re: Metatag processing (overlapping regions)
by tye (Sage) on Nov 12, 2000 at 01:35 UTC

    Nah, I'd go with your first approach:

    #!/usr/bin/perl -w use strict; my $str= "aa<bb>aa<U>aa</bb>aa</U>aa"; $str =~ s#<bb>(.*?)</bb># my $x= $1; $x =~ s/a/bb/g; $x #ge; $str =~ s#<U>(.*?)</U>#\U$1#g; print "($str)\n" __END__ prints (aabbbbBBBBAAaa)
    If you insist on processing the string once, then I'd build a separate output string so that you don't modify the input string.

            - tye (but my friends call me "Tye")
      Yeah, so it might. But it isn't entirely nonproblematic approach, either:

      So, we have this processing function for <bb>, like, { s/a/bb/g; }. Cool. But what if I have another metatag called "<a>" ? It got instantly more complicated. Now I have to write code to the processor that will apply the substitution only for text outside of metatags, that means writing code to detect the tags inside the substitution text.

      Now, suppose I make another metatag "<foo>" that will substitute "foo => bar":

      "<foo>fo<U>o bar</U></foo>" => "baR BAR".

      Now, the replacer wouldn't find "foo" because there was "<U>" inside it, even though the "<U>" should be null length and invisible.

      Maybe that would explain my preference to do it with plaintext and array of offsets and tag types (the second approach)?

        And what happens if you want to define an escaped region?

        Or say that a \ escapes what would otherwise be a tag?

        My recommendation matches tye's. If the problem is going to stay simple then KISS, process each tag with a pass. If it isn't then create a one-pass algorithm where as you pass through the initial string you incrementally create the replaced string. This approach is more complex to set up, but once set up is a lot more flexible.

        One approach is to use Parse::RecDescent. There are lots of examples of that around, if you run into trouble then ask questions. Another is to roll your own, for an example of how to do that you could take a look at Why I like functional programming. (Skip to the function at the very end, the problem solved there is much more complex than yours here, and most of the code there is irrelevant to you.)

        But trying to do all of this logic with modifying in place is going to be simple insanity to figure out.