in reply to regex in form !regex->regex<-!regex

As ELISHEVA rightly points out, this really is a job for an HTML parser. The task you want to accomplish is generally not worth the effort it takes for the result - probably the two most challenging aspects for getting your desired result are the possibility of nested tags and the lack of support for variable width look-behinds (Looking ahead and looking behind). You could get something like your desired behavior with:

#!/usr/bin/perl use strict; use warnings; my $text = <<EOT; <p>This is a line with a break.</p><pre>This is a pre with a break.</pre><p>This is a line with a break.</p> EOT 1 while $text =~ s{^((?:(?!<pre>).|<pre>(?:(?!</pre>).)*</pre>)*?)\n}{ +$1<br/>}is; print $text;
which outputs
<p>This is a line<br/>with a break.</p><pre>This is a pre with a break.</pre><p>This is a line<br/>with a break.</p><br/>
YAPE::Regex::Explain breaks this down as
The regular expression: (?is-mx:^((?:(?!<pre>).|<pre>(?:(?!</pre>).)*</pre>)*?)\n) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?is-mx: group, but do not capture (case-insensitive) (with . matching \n) (with ^ and $ matching normally) (matching whitespace and # normally): ---------------------------------------------------------------------- ^ the beginning of the string ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- (?: group, but do not capture (0 or more times (matching the least amount possible)): ---------------------------------------------------------------------- (?! look ahead to see if there is not: ---------------------------------------------------------------------- <pre> '<pre>' ---------------------------------------------------------------------- ) end of look-ahead ---------------------------------------------------------------------- . any character ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- <pre> '<pre>' ---------------------------------------------------------------------- (?: group, but do not capture (0 or more times (matching the most amount possible)): ---------------------------------------------------------------------- (?! look ahead to see if there is not: ---------------------------------------------------------------------- </pre> '</pre>' ---------------------------------------------------------------------- ) end of look-ahead ---------------------------------------------------------------------- . any character ---------------------------------------------------------------------- )* end of grouping ---------------------------------------------------------------------- </pre> '</pre>' ---------------------------------------------------------------------- )*? end of grouping ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- \n '\n' (newline) ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------
Note that you have to rerun the regex (as opposed to using the g modifier) since you have to always anchor at the start. Also note that trailing br. That hints at a larger problem - are you absolutely certain you want to change all newlines in your input? They tend to show up in strange locations. It's all these corner cases that make a pre-built library so worth while. HTML::Parser has been tested and debugged for 15 years, not the 15 minutes one would like to spend.

Replies are listed 'Best First'.
Re^2: regex in form !regex->regex<-!regex
by forestcreature (Novice) on Feb 23, 2011 at 16:30 UTC

    You make a good case, many thanks for the help both of you!

    JJ