Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I've got a script which essentially restricts HTML within a post to certain tags (much like many sites), and automatically adds 'BR' tags to new lines. One of the tags I'd like to allow is the 'PRE' tag, which should not have a break tag to break lines.

I use the regex 's/\n/<BR>\n/gs' to add the break tags to the whole comment, then I'd like to remove the tags in PRE by using something like:

while(/&lt;pre>(.*?)&lt;\/pre>/) { $1 =~ s/<BR>\n/\n/gs; }

Naturally this doesn't work, but how does one perform a substitute on a match in such a situation?

This is just one example of this; another time I've come across this problem is using perl to highlight code syntax on a DOS box using ANSI escape sequences: I would have like to be able to remove the escape sequences that appear within comments or quotes, so something like (off the top of my head):

while(/"(.*?)"/g) { $1 =~ s/\c@\[\d+[a-x]//g; }

Thanks!

Replies are listed 'Best First'.
Re: Substitute within a search
by davorg (Chancellor) on Aug 26, 2004 at 11:06 UTC

    Trying to parse HTML with regular expressions is generally a bad idea. It'll seem to work for a while until someone sends you some valid HTML that you haven't thought about.

    The best way is to use an HTML parser like, for example HTML::Parser.

    --
    <http://www.dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

Re: Substitute within a search
by tachyon (Chancellor) on Aug 26, 2004 at 11:11 UTC

    You were actually on exactly the right track with your first regex. The simple trick is to assign $1 to another var, modify that var, and then use the modified value as the substiution like this:

    $html =~ s/\n/<br>\n/g; # assume this, and then fix <br> in pre block +s with..... $html =~ s{<pre>(.*?)</pre>} { local $_=$1; s/<br>//g; $_ }gse; print $html;

    We need the /g of course, the /s to make . match \n if present and the /e to exec the code in the replace block. The last thing evaluated in the replace block is what is used.

    cheers

    tachyon

      Thanks, that's exactly what I was looking for! With a small modification, of course: It needs either the parens removed from the first regex, and $1 replaced with $@, or the parens moved to surround the whole expression. That caught me out initailly, as the 'pre' tags weren't coming out ;)
Re: Substitute within a search
by ccn (Vicar) on Aug 26, 2004 at 10:55 UTC

    $str = "&lt;pre><br>\n&lt;/pre>"; while($str =~ /(&lt;pre>.*?&lt;\/pre>)/gsi) { my $pos = pos $str; substr($str, pos($str) - length($1), length($1)) =~ s/<BR>\n/\n/ +gsi; pos($str) = $pos; } print $str;
Re: Substitute within a search
by bart (Canon) on Aug 26, 2004 at 17:04 UTC
    My favourite solution to problems like these, is to replace those sections of text which you want to skip, by themselves — effectively only doing the real substitutions, in all other sections. Something like this:
    s/(<pre>.*?<\/pre>)|\n/$1 || "<br>\n"/isge;
    This way you'll only replace newlines that aren't in <PRE> sections.