Re: Anchors, bleh :(

You don't say how these sections of text are demarked, but assuming the matching requirements for the start and end of the demarked sections are reasonable, you can do what you need much more simply. Eg. the text of your post contains five sets of parens. If you consider those to be the untouchable text, and the rest is to be mangled, then this achieves that without all the messing about with placeholders. There are no restrictions on what mangling you do within the replace side of the s///, including using other regex safely, or calling a subroutine:

#! perl -slw
use strict;

my $data = do{ local $/; <DATA> };
my $start = '(';
my $stop = ')';
$data =~ s[((?:^|\Q$stop\E).+?(?=\Q$start\E|$))]{
    my $toModify = $1;
    $toModify = uc $toModify;
    $toModify;
}seg;

print $data;

__DATA__
The text from your OP goes here.
[download]

Produces

C:\test>666970
HI, ALL:

I'VE GOT A RATHER LONG SCRIPT - IT'S ONE OF THOSE THAT GREW BY
AGGLOMERATION AND, WELL, IT'LL GET REWRITTEN SOMEDAY. REALLY.
[WRY LOOK].

ANYWAY... IT DOES A HUGE AMOUNT OF TEXT MANGLING - ESSENTIALLY
PROCESSING EMAILS AND SETTING UP THE CONTENT TO BE DISPLAYED ON
THE WEB - AND WORKS WELL, BUT THERE'S BEEN ONE THING THAT I'VE
WANTED IT TO DO FOR A LONG WHILE NOW, AND JUST GOT AROUND TO
IMPLEMENTING: I WANT IT TO LEAVE SPECIFIC, DEMARCATED CHUNKS OF
TEXT ALONE, NO PROCESSING TO BE DONE AT ALL.

WHAT I'VE DONE IS TO FIND THESE CHUNKS, EXTRACT THEM, AND PUSH
THEM ONTO AN ARRAY, THEN REPLACE THEM WITH NUMBERED ANCHORS
(e.g., "XXX_REINSERT{12}_XXX" - '12' is the index within that array).
I THEN DO THE PROCESSING, AND - OBVIOUSLY - REPLACE THE ANCHORS WITH T
+HE
"HELD BACK" BITS.

THE CODE IS REASONABLY OBVIOUS - ALTHOUGH I ENDED UP USING A BUNCH OF
"SUBSTR"S INSTEAD OF 'S///' FOR SEVERAL REASONS - AND I DON'T THINK
IT'S WORTH POSTING HERE (unless someone wants to see it) -
BECAUSE MY QUESTION IS OF A MORE GENERAL NATURE. HERE IT IS:

GIVEN THAT THE LENGTH OF THE OVERALL STRING (the email body) IS GOING
TO BE CHANGED ARBITRARILY, AND THAT THE WHOLE TEXT-MANGLING ROUTINE
IS BIG ENOUGH THAT I WANT TO MINIMIZE THE NUMBER OF PASSES
(i.e., I don't want to run it on the multiple "interleaved" chunks
between the 'raw' bits), IS THERE A BETTER PROGRAMMATIC APPROACH
THAN ANCHORS OF THIS SORT? THIS APPROACH SEEMS RATHER CRUDE, AND HAS
AN OBVIOUS, ALTHOUGH RATHER EASILY AVOIDABLE FAILURE MODE
(what if there's a line in the text that actually says
'XXX_REINSERT_"-whatever?), AND I'D LIKE TO SEE IF MY FELLOW MONKS
HAVE SOME WISDOM TO SHARE ON THIS ISSUE.

THANKS IN ADVANCE!
[download]

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Comment on Re: Anchors, bleh :( Select or Download Code

Replies are listed 'Best First'.
Re^2: Anchors, bleh :( (regex alts) by tye (Sage) on Feb 10, 2008 at 18:35 UTC
Good idea. It reminds me of a similar idea that is at the heart of many parsers. You make your regex of the form `(things to be left alone)\|(things to mangle)`, for example: `s{ ( [(] [^()]* [)] ) # Leave text in parens alone \| <(\w+)> # Replace <word> with $replace{word} }{ $1 ? $1 : $replace($2) }gex;` [download] - tye	[reply] [d/l] [select]
Re^2: Anchors, bleh :( by oko1 (Deacon) on Feb 12, 2008 at 15:45 UTC
> You don't say how these sections of text are demarked, but > assuming the matching requirements for the start and end of > the demarked sections are reasonable, you can do what you > need much more simply. [applause] Brilliant, and exactly what I was asking for; unfortunately, it showed me that I was asking for the wrong thing. :\ Not horribly wrong, just a little off - but there's too much interaction between the results of this routine and the rest of the 'cleanup' routine to separate the two like this. What ends up happening when I try to use it this way is that the sub sees each chunk of data as a stand-alone piece and wraps paragraph markers around it - which results in the 'raw' piece always being a separate paragraph. So, it seems that this bit of processing has to remain part of the 'cleanup' routine itself - that is, I need to extract those pieces, mangle the main body, stick those pieces back in, and then finish paragraphing everything. Darn it. Well, the intent was good, anyway - and thanks to all you wizards, I've learned something. Thank you! Incidentally, I had to modify your regex a bit so it would get rid of the tags themselves: `my $raw_start = '[RAW]'; my $raw_stop = '[/RAW]'; $body =~ s[(?:^\|\Q$raw_stop\E)(.+?)(?:\Q$raw_start\E\|$)]{ my $toModify = $1; $toModify = cleanup($toModify); $toModify; }seg;` [download]	[reply] [d/l]