oko1 has asked for the wisdom of the Perl Monks concerning the following question:

Hi, all:

I've got a rather long script - it's one of those that grew by agglomeration and, well, it'll get rewritten someday. Really. [wry look].

Anyway... it does a huge amount of text mangling - essentially processing emails and setting up the content to be displayed on the Web - and works well, but there's been one thing that I've wanted it to do for a long while now, and just got around to implementing: I want it to leave specific, demarcated chunks of text alone, no processing to be done at all.

What I've done is to find these chunks, extract them, and push them onto an array, then replace them with numbered anchors (e.g., "XXX_REINSERT{12}_XXX" - '12' is the index within that array). I then do the processing, and - obviously - replace the anchors with the "held back" bits.

The code is reasonably obvious - although I ended up using a bunch of "substr"s instead of 's///' for several reasons - and I don't think it's worth posting here (unless someone wants to see it) - because my question is of a more general nature. Here it is:

Given that the length of the overall string (the email body) is going to be changed arbitrarily, and that the whole text-mangling routine is big enough that I want to minimize the number of passes (i.e., I don't want to run it on the multiple "interleaved" chunks between the 'raw' bits), is there a better programmatic approach than anchors of this sort? This approach seems rather crude, and has an obvious, although rather easily avoidable failure mode (what if there's a line in the text that actually says 'XXX_REINSERT_"-whatever?), and I'd like to see if my fellow Monks have some wisdom to share on this issue.

Thanks in advance!

Replies are listed 'Best First'.
Re: Anchors, bleh :(
by Fletch (Bishop) on Feb 08, 2008 at 14:29 UTC

    Sounds like you probably could rip it out and replace it with something like Template Toolkit or the like. Just replace your placeholders (I'd call them that before "anchors", which usually brings to mind ^ and \z and related regexen friends in a Perl context) with [% REINSERT.12 %] and populate the template variables accordingly ($template_vars->{'REINSERT'} = \@reinsert_array;). Then you get to let someone else worry about the efficiency bits.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

      [laugh] I'm not surprised to discover that someone's already automated the thing that I'm doing manually; that's My Life with Perl (although I do tend to check CPAN when things get complex; this wasn't complex, just... intellectually itchy.) I'm just wondering if this kind of approach is the only answer to this kind of problem.
Re: Anchors, bleh :(
by roboticus (Chancellor) on Feb 08, 2008 at 14:27 UTC
    oko1:

    Don't rewrite your script someday. Just choose a chunk and refactor it each time you modify it, so it can evolve into a better script. Choose either the chunk that offends you most, or the chunk you're working on to add the feature.

    That way, if the script doesn't need many modifications, it never gets perfect--but you won't care since you're not in the code. If you have to make frequent modifications, it will quickly get into shape.

    And here's a simple way to eliminate that failure mode you mention: Treat those 'fake' anchors as some of the specific, demarcated chunks of text you want to preserve. That way, you'll simply replace the thing that looks like a marker with a real marker. Then when you plug the original bits back in, you're golden.

    ...roboticus

      That's actually what I've been doing - and, in fact, the individual chunks of the script (the various routines) are pretty darned good. What needs rewriting is the overall structure: it's *really* not well-defined, and would be hell for someone else to read despite the copious comments.

      > And here's a simple way to eliminate that failure mode 
      > you mention: Treat those 'fake' anchors as some of the 
      > specific, demarcated chunks of text you want to 
      > preserve. That way, you'll simply replace the thing 
      > that looks like a marker with a real marker. Then when 
      > you plug the original bits back in, you're golden.
      

      Well, as I've said - it's easily avoidable. I just think that there should be a more... graceful way to do this - sort of like using a Schwartzian transform instead of plodding through a bunch of loops and fiddling with temp vars.

Re: Anchors, bleh :( (escape)
by tye (Sage) on Feb 09, 2008 at 08:01 UTC

    MIME uses an idea that is kinda neat and could help here. You pick your placeholder, say "XXX", then search to see if that clashes (because it actually appears in the text already). If it does, then you look at the character after the first occurrance of it in your text and append any other character to your placeholder and search from that point again. Repeat until you reach the end of your text. You will have only traversed the text one time and when you are done you'll have a placeholder that does not appear anywhere in your text. Then you can append your sequence numbers (plus a non-digit terminator) to get your set of conflict-free placeholders.

    But you might have to worry about your manipulations creating a conflict with this placeholder.

    Another route would be to "escape" any occurrances of your placeholder both in the original text and in any substitutions that get applied to the text. Then unescape those after you replace the placeholders. For example:

    $text =~ s/%/%%/g; # replace first block with "(%1%)" # replace second block with "(%2%)" # ... my %subs= ( replaceThis => "withThis", # ... ); for( @subs{ keys %subs } ) { s/%/%%/g; } $text =~ s/$_/$subs{$_}/g for keys %subs; # replace (%1%) with original first block # ... $text =~ s/%%/%/g;

    Then you only have to worry about your manipulations accidentally changing a placeholder (which can often be easy to avoid in practice -- which it probably is in your case since you didn't appear worried about it).

    the whole text-mangling routine is big enough that I want to minimize the number of passes (i.e., I don't want to run it on the multiple "interleaved" chunks between the 'raw' bits)

    Your concern there appears to be one of speed of execution. You might reconsider this concern (or at least test it), as running the long mangling process several times on short strings could certainly end up not being much slower than running it once on the much longer full string.

    It is certainly possible to just remove chunks from the string, note the resulting offsets to those spots, and keep running totals of how much these offsets were shifted by each substitution. But that is complex enough that it is quite easy to get it wrong, so I don't think I'd recommend that approach. And I can't think of any alternatives that are better than the above ones.

    - tye        

      > MIME uses an idea that is kinda neat and could help here.
      > [...]
      

      Oh, that's very cute! I can think of a couple of ways of implementing it; in fact, if the data isn't too big, you could say something like this (untested, but the intent should be obvious):

      $delim = '!@#$'; # Test for both the 'tag' and the '/tag' varsion $delim .= chr(rand(95) + 33) while $data =~ m{/?$delim}s;
      > Your concern there appears to be one of speed of 
      > execution. You might reconsider this concern (or at 
      > least test it), as running the long mangling process 
      > several times on short strings could certainly end up 
      > not being much slower than running it once on the much 
      > longer full string.
      

      [blink] Yes. The latter was exactly what I was saying, so we're in violent agreement. :) That's why I'm trying to come up with code that will let me do that.

      I actually ran a test even before posting here - I'm quite aware that "efficiency is the hobgoblin of little minds". The speed of execution, for what was admittedly a rather large mail archive, went from about two seconds (using a script without this routine) to several minutes (I killed it after about two and a half; no, it wasn't stuck in a loop.) Tagging, removing, processing (once), and replacing - less than three seconds.

      Perhaps the way that I had written the new routine was at fault - I'm not sure - but I am sure that running my "mangling" routine more than once is quite expensive.

Re: Anchors, bleh :(
by BrowserUk (Patriarch) on Feb 10, 2008 at 16:59 UTC

    You don't say how these sections of text are demarked, but assuming the matching requirements for the start and end of the demarked sections are reasonable, you can do what you need much more simply. Eg. the text of your post contains five sets of parens. If you consider those to be the untouchable text, and the rest is to be mangled, then this achieves that without all the messing about with placeholders. There are no restrictions on what mangling you do within the replace side of the s///, including using other regex safely, or calling a subroutine:

    #! perl -slw use strict; my $data = do{ local $/; <DATA> }; my $start = '('; my $stop = ')'; $data =~ s[((?:^|\Q$stop\E).+?(?=\Q$start\E|$))]{ my $toModify = $1; $toModify = uc $toModify; $toModify; }seg; print $data; __DATA__ The text from your OP goes here.

    Produces

    C:\test>666970 HI, ALL: I'VE GOT A RATHER LONG SCRIPT - IT'S ONE OF THOSE THAT GREW BY AGGLOMERATION AND, WELL, IT'LL GET REWRITTEN SOMEDAY. REALLY. [WRY LOOK]. ANYWAY... IT DOES A HUGE AMOUNT OF TEXT MANGLING - ESSENTIALLY PROCESSING EMAILS AND SETTING UP THE CONTENT TO BE DISPLAYED ON THE WEB - AND WORKS WELL, BUT THERE'S BEEN ONE THING THAT I'VE WANTED IT TO DO FOR A LONG WHILE NOW, AND JUST GOT AROUND TO IMPLEMENTING: I WANT IT TO LEAVE SPECIFIC, DEMARCATED CHUNKS OF TEXT ALONE, NO PROCESSING TO BE DONE AT ALL. WHAT I'VE DONE IS TO FIND THESE CHUNKS, EXTRACT THEM, AND PUSH THEM ONTO AN ARRAY, THEN REPLACE THEM WITH NUMBERED ANCHORS (e.g., "XXX_REINSERT{12}_XXX" - '12' is the index within that array). I THEN DO THE PROCESSING, AND - OBVIOUSLY - REPLACE THE ANCHORS WITH T +HE "HELD BACK" BITS. THE CODE IS REASONABLY OBVIOUS - ALTHOUGH I ENDED UP USING A BUNCH OF "SUBSTR"S INSTEAD OF 'S///' FOR SEVERAL REASONS - AND I DON'T THINK IT'S WORTH POSTING HERE (unless someone wants to see it) - BECAUSE MY QUESTION IS OF A MORE GENERAL NATURE. HERE IT IS: GIVEN THAT THE LENGTH OF THE OVERALL STRING (the email body) IS GOING TO BE CHANGED ARBITRARILY, AND THAT THE WHOLE TEXT-MANGLING ROUTINE IS BIG ENOUGH THAT I WANT TO MINIMIZE THE NUMBER OF PASSES (i.e., I don't want to run it on the multiple "interleaved" chunks between the 'raw' bits), IS THERE A BETTER PROGRAMMATIC APPROACH THAN ANCHORS OF THIS SORT? THIS APPROACH SEEMS RATHER CRUDE, AND HAS AN OBVIOUS, ALTHOUGH RATHER EASILY AVOIDABLE FAILURE MODE (what if there's a line in the text that actually says 'XXX_REINSERT_"-whatever?), AND I'D LIKE TO SEE IF MY FELLOW MONKS HAVE SOME WISDOM TO SHARE ON THIS ISSUE. THANKS IN ADVANCE!

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Good idea. It reminds me of a similar idea that is at the heart of many parsers. You make your regex of the form (things to be left alone)|(things to mangle), for example:

      s{ ( [(] [^()]* [)] ) # Leave text in parens alone | <(\w+)> # Replace <word> with $replace{word} }{ $1 ? $1 : $replace($2) }gex;

      - tye        

      > You don't say how these sections of text are demarked, but 
      > assuming the matching requirements for the start and end of 
      > the demarked sections are reasonable, you can do what you 
      > need much more simply.
      

      [applause] Brilliant, and exactly what I was asking for; unfortunately, it showed me that I was asking for the wrong thing. :\ Not horribly wrong, just a little off - but there's too much interaction between the results of this routine and the rest of the 'cleanup' routine to separate the two like this. What ends up happening when I try to use it this way is that the sub sees each chunk of data as a stand-alone piece and wraps paragraph markers around it - which results in the 'raw' piece always being a separate paragraph. So, it seems that this bit of processing has to remain part of the 'cleanup' routine itself - that is, I need to extract those pieces, mangle the main body, stick those pieces back in, and then finish paragraphing everything.

      Darn it. Well, the intent was good, anyway - and thanks to all you wizards, I've learned something. Thank you!

      Incidentally, I had to modify your regex a bit so it would get rid of the tags themselves:

      my $raw_start = '[RAW]'; my $raw_stop = '[/RAW]'; $body =~ s[(?:^|\Q$raw_stop\E)(.+?)(?:\Q$raw_start\E|$)]{ my $toModify = $1; $toModify = cleanup($toModify); $toModify; }seg;
Re: Anchors, bleh :(
by hexcoder (Curate) on Feb 08, 2008 at 19:14 UTC
    Hi,

    if the text body can be split flatly (that is not nested)
    into parts which need mangling
    and into parts to be left alone
    it would be easy.
    Then, you could loop through the text like this
    (but maybe you should not use the patterns from below:-)

    use strict; use warnings; my $text = join q{}, <DATA>; my $newtext = q{}; while ($text =~ m{ \G # start where we left (.*?) # text to be mangled (?: # followed by STARTPATTERN # the start of verbatim mark (.*?) # the text to be taken verbatim ENDPATTERN # the end of verbatim mark )? # must not be always present }xmsg) { # add the mangled version of the text $newtext .= mangle($1); if (defined $2) { # if there is a verbatim part, copy it verbatim $newtext .= $2; } } print $newtext; exit 0; sub mangle { return uc $_[0]; } __DATA__ Once upon STARTPATTERNa timeENDPATTERN there was a Perl monk. STARTPATTERNBy steadily trainingENDPATTERN the great virtues he became very wise...
    gives
    ONCE UPON a time THERE WAS A PERL MONK. By steadily training THE GREAT VIRTUES HE BECAME VERY WISE...
    Hope this helps

      Thank you - you've just given me the last bit I've been mising on understanding '\G' (I knew about it, but had failed to produce useful code when I'd tried it before.) Much appreciated!

      Unfortunately, with regard to this script, your implementation hits one of the restrictions that I stated: the "mangling" is so extensive that breaking up the text into the 'raw/cooked' bits and then processing each cooked bit (as opposed to the single chunk sans the raw bits) is too expensive, time-wise.

      I'm slowly coming to believe that, given my particular requirements, the 'tag-and-reinsert' method may well be the best one. This is reinforced by the fact that there are modules out there that do the same thing (the [% foo.x %] syntax mentioned previously, etc.) Oh well...

      Again, thank you for your response.