in reply to Anchors, bleh :(

MIME uses an idea that is kinda neat and could help here. You pick your placeholder, say "XXX", then search to see if that clashes (because it actually appears in the text already). If it does, then you look at the character after the first occurrance of it in your text and append any other character to your placeholder and search from that point again. Repeat until you reach the end of your text. You will have only traversed the text one time and when you are done you'll have a placeholder that does not appear anywhere in your text. Then you can append your sequence numbers (plus a non-digit terminator) to get your set of conflict-free placeholders.

But you might have to worry about your manipulations creating a conflict with this placeholder.

Another route would be to "escape" any occurrances of your placeholder both in the original text and in any substitutions that get applied to the text. Then unescape those after you replace the placeholders. For example:

$text =~ s/%/%%/g; # replace first block with "(%1%)" # replace second block with "(%2%)" # ... my %subs= ( replaceThis => "withThis", # ... ); for( @subs{ keys %subs } ) { s/%/%%/g; } $text =~ s/$_/$subs{$_}/g for keys %subs; # replace (%1%) with original first block # ... $text =~ s/%%/%/g;

Then you only have to worry about your manipulations accidentally changing a placeholder (which can often be easy to avoid in practice -- which it probably is in your case since you didn't appear worried about it).

the whole text-mangling routine is big enough that I want to minimize the number of passes (i.e., I don't want to run it on the multiple "interleaved" chunks between the 'raw' bits)

Your concern there appears to be one of speed of execution. You might reconsider this concern (or at least test it), as running the long mangling process several times on short strings could certainly end up not being much slower than running it once on the much longer full string.

It is certainly possible to just remove chunks from the string, note the resulting offsets to those spots, and keep running totals of how much these offsets were shifted by each substitution. But that is complex enough that it is quite easy to get it wrong, so I don't think I'd recommend that approach. And I can't think of any alternatives that are better than the above ones.

- tye        

Replies are listed 'Best First'.
Re^2: Anchors, bleh :( (escape)
by oko1 (Deacon) on Feb 10, 2008 at 15:16 UTC
    > MIME uses an idea that is kinda neat and could help here.
    > [...]
    

    Oh, that's very cute! I can think of a couple of ways of implementing it; in fact, if the data isn't too big, you could say something like this (untested, but the intent should be obvious):

    $delim = '!@#$'; # Test for both the 'tag' and the '/tag' varsion $delim .= chr(rand(95) + 33) while $data =~ m{/?$delim}s;
    > Your concern there appears to be one of speed of 
    > execution. You might reconsider this concern (or at 
    > least test it), as running the long mangling process 
    > several times on short strings could certainly end up 
    > not being much slower than running it once on the much 
    > longer full string.
    

    [blink] Yes. The latter was exactly what I was saying, so we're in violent agreement. :) That's why I'm trying to come up with code that will let me do that.

    I actually ran a test even before posting here - I'm quite aware that "efficiency is the hobgoblin of little minds". The speed of execution, for what was admittedly a rather large mail archive, went from about two seconds (using a script without this routine) to several minutes (I killed it after about two and a half; no, it wasn't stuck in a loop.) Tagging, removing, processing (once), and replacing - less than three seconds.

    Perhaps the way that I had written the new routine was at fault - I'm not sure - but I am sure that running my "mangling" routine more than once is quite expensive.