in reply to Substitution inside tags, as 1 line

Purists are cringing at your apparent belief that <p> marks the end of a paragraph. It marks the beginning of a paragraph, which is then terminated by </p>. Your confusion is widespread and pardonable, because the terminal </p> is optional, and your orphan line at the beginning will usually be rendered exactly like a paragraph.

So here's how to do what you are trying to do:

s/(<pre>\n(?:[^\n]*<p>\n)*)([^>\n]*)\n(.*?<\/pre>)/$1$2<p>\n$3/ms

This assumes, as you do, that the opening <pre> is on a line of its own. I further assume that you start with no markup of any kind in your <pre> block. The substitution puts <p> at the end of each line that doesn't yet contain markup.

I think my attempt may be the kind of thing you're looking for, but you may find further problems with this approach. Before you spend too much more time on this regex, I'd advise you to either process the file line-by-line (as you're already thinking of doing), or better yet, drop regexes altogether and learn about parsers.

Replies are listed 'Best First'.
Re^2: Substitution inside tags, as 1 line
by Anonymous Monk on Oct 14, 2008 at 07:14 UTC
    Both m and s options on s///?
    e Evaluate the right side as an expression. g Replace globally, i.e., all occurrences. i Do case-insensitive pattern matching. m Treat string as multiple lines. o Compile pattern only once. s Treat string as single line.
      From Perl Programming, 3rd Edition, by Larry Wall, etc, P153.
      /m Let ^ and $ match next to embedded \n. /s Let . match newlines and ignore deprecated $*.
Re^2: Substitution inside tags, as 1 line
by tel2 (Pilgrim) on Oct 14, 2008 at 08:44 UTC
    Thanks Narveson,

    Nice work!

    So would your full answer be:
    - To put that in a while loop, and
    - Add the <pre> & </pre> tag removal code, like this:

    perl -0 -pe '1 while (s/(<pre>\n(?:[^\n]*<p>\n)*)([^>\n]*)\n(.*?<\/pre +>)/$1$2<p>\n$3/ms);s/<\/?pre>//g' htmlfile
    ?

      My full answer would be:

      Perhaps you'll manage to get this to work, but really, regexes, wonderful as they are, are the wrong tool here. I offered a bit of code in the spirit of "Don't you see how hairy this is going to have to be?"

      Parse your HTML. wfsp has been kind enough to furnish details.