Re: regex in form !regex->regex<-!regex

As ELISHEVA rightly points out, this really is a job for an HTML parser. The task you want to accomplish is generally not worth the effort it takes for the result - probably the two most challenging aspects for getting your desired result are the possibility of nested tags and the lack of support for variable width look-behinds (Looking ahead and looking behind). You could get something like your desired behavior with:

#!/usr/bin/perl
use strict;
use warnings;

my $text = <<EOT;
<p>This is a line
with a break.</p><pre>This is a pre
with a break.</pre><p>This is a line
with a break.</p>
EOT

1 while $text =~ s{^((?:(?!<pre>).|<pre>(?:(?!</pre>).)*</pre>)*?)\n}{
+$1<br/>}is;
print $text;
[download]

which outputs

<p>This is a line<br/>with a break.</p><pre>This is a pre
with a break.</pre><p>This is a line<br/>with a break.</p><br/>
[download]

YAPE::Regex::Explain breaks this down as

The regular expression:

(?is-mx:^((?:(?!<pre>).|<pre>(?:(?!</pre>).)*</pre>)*?)\n)

matches as follows:
  
NODE                     EXPLANATION
----------------------------------------------------------------------
(?is-mx:                 group, but do not capture (case-insensitive)
                         (with . matching \n) (with ^ and $ matching
                         normally) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  ^                        the beginning of the string
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
----------------------------------------------------------------------
      (?!                      look ahead to see if there is not:
----------------------------------------------------------------------
        <pre>                    '<pre>'
----------------------------------------------------------------------
      )                        end of look-ahead
----------------------------------------------------------------------
      .                        any character
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      <pre>                    '<pre>'
----------------------------------------------------------------------
      (?:                      group, but do not capture (0 or more
                               times (matching the most amount
                               possible)):
----------------------------------------------------------------------
        (?!                      look ahead to see if there is not:
----------------------------------------------------------------------
          </pre>                   '</pre>'
----------------------------------------------------------------------
        )                        end of look-ahead
----------------------------------------------------------------------
        .                        any character
----------------------------------------------------------------------
      )*                       end of grouping
----------------------------------------------------------------------
      </pre>                   '</pre>'
----------------------------------------------------------------------
    )*?                      end of grouping
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  \n                       '\n' (newline)
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------
[download]

Note that you have to rerun the regex (as opposed to using the g modifier) since you have to always anchor at the start. Also note that trailing br. That hints at a larger problem - are you absolutely certain you want to change all newlines in your input? They tend to show up in strange locations. It's all these corner cases that make a pre-built library so worth while. HTML::Parser has been tested and debugged for 15 years, not the 15 minutes one would like to spend.

Comment on Re: regex in form !regex->regex<-!regex Select or Download Code

Replies are listed 'Best First'.
Re^2: regex in form !regex->regex<-!regex by forestcreature (Novice) on Feb 23, 2011 at 16:30 UTC
You make a good case, many thanks for the help both of you! JJ	[reply]