tel2 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I have an HTML text file in this kind of format:

<html> ...etc... <pre> Line 1 Line 2 ...etc... Line n </pre> ...etc... </html>

I'd like to put a <p> tag at the end of each line which appears between the <pre> & </pre> tags, and finally remove the <pre> & </pre> tags, to give me this:

<html> ...etc... Line 1<p> Line 2<p> ...etc...<p> Line n<p> ...etc... </html>

I guess the simple and efficient way to do that would be to process the file line-by-line (and I might end up using such code), but first I'd like to see how it could be done treating the whole file as a single line. Here's the code I tried:

perl -0 -pe '1 while (s/(<pre>\n.*?)\n(.*<\/pre>)/$1<p>$2/ms);s/<\/?pr +e>//g' htmlfile
That works, except it strips the \n chars out like this of course:
<html> ...etc... <pre> Line 1<p>Line 2<p>...etc...<p>Line n<p></pre> ...etc... </html>

So I tried this:

perl -0 -pe '1 while (s/(<pre>\n.*?)\n(.*<\/pre>)/$1<p>\n$2/ms);s/<\/? +pre>//g' htmlfile
But that loops infintely because it keeps on matching "Line 1", which becomes "Line 1<p><p><p>...".

How can I concisely write such code, processing htmlfile as a single line?

Thanks...Terry

Replies are listed 'Best First'.
Re: Substitution inside tags, as 1 line
by Narveson (Chaplain) on Oct 14, 2008 at 04:38 UTC

    Purists are cringing at your apparent belief that <p> marks the end of a paragraph. It marks the beginning of a paragraph, which is then terminated by </p>. Your confusion is widespread and pardonable, because the terminal </p> is optional, and your orphan line at the beginning will usually be rendered exactly like a paragraph.

    So here's how to do what you are trying to do:

    s/(<pre>\n(?:[^\n]*<p>\n)*)([^>\n]*)\n(.*?<\/pre>)/$1$2<p>\n$3/ms

    This assumes, as you do, that the opening <pre> is on a line of its own. I further assume that you start with no markup of any kind in your <pre> block. The substitution puts <p> at the end of each line that doesn't yet contain markup.

    I think my attempt may be the kind of thing you're looking for, but you may find further problems with this approach. Before you spend too much more time on this regex, I'd advise you to either process the file line-by-line (as you're already thinking of doing), or better yet, drop regexes altogether and learn about parsers.

      Both m and s options on s///?
      e Evaluate the right side as an expression. g Replace globally, i.e., all occurrences. i Do case-insensitive pattern matching. m Treat string as multiple lines. o Compile pattern only once. s Treat string as single line.
        From Perl Programming, 3rd Edition, by Larry Wall, etc, P153.
        /m Let ^ and $ match next to embedded \n. /s Let . match newlines and ignore deprecated $*.
      Thanks Narveson,

      Nice work!

      So would your full answer be:
      - To put that in a while loop, and
      - Add the <pre> & </pre> tag removal code, like this:

      perl -0 -pe '1 while (s/(<pre>\n(?:[^\n]*<p>\n)*)([^>\n]*)\n(.*?<\/pre +>)/$1$2<p>\n$3/ms);s/<\/?pre>//g' htmlfile
      ?

        My full answer would be:

        Perhaps you'll manage to get this to work, but really, regexes, wonderful as they are, are the wrong tool here. I offered a bit of code in the spirit of "Don't you see how hairy this is going to have to be?"

        Parse your HTML. wfsp has been kind enough to furnish details.

Re: Substitution inside tags, as 1 line
by NetWallah (Canon) on Oct 14, 2008 at 06:30 UTC
    Try using the flip-flop operator:
    perl -pe 'm|<pre>|...m|</pre>| and $_.="<p/>"' < your-html-file
    Output :
    <html> ...etc... <pre> <p/>Line 1 <p/>Line 2 <p/>...etc... <p/>Line n <p/></pre> <p/>...etc... </html>

         Have you been high today? I see the nuns are gay! My brother yelled to me...I love you inside Ed - Benny Lava, by Buffalax

      Nice, but why do you use <p/>? That's equivalent to <p></p>, thus inserting an empty paragraph before each line, which sounds rather nonsensical to me.

      If you try to be correct about p nesting, then write $_ = "<p>$_</p>" instead.

      Or use <br /> instead.

      That's a nice looking alternative, NetWallah. Not what I had in mind, but it has a certain elegence & simplicity.

      Would you then suggest I put the output of that through a subsitution that removes the two "</?pre>\n</p>" matches, to finish off my requirements? How would you do it?

      Thanks.

        Slightly uglier, getting rid of the extra </p> around the <PRE>, but still readable:
        perl -pe 'm|<pre>|..m|</pre>| and {m|</?pre>| or $_=qq|<p>$_</p>|} ' < + Your-file
        Also removed the unnecessary empty para (per mortiz, and kept it XHTML-compatible !

             Have you been high today? I see the nuns are gay! My brother yelled to me...I love you inside Ed - Benny Lava, by Buffalax

Re: Substitution inside tags, as 1 line
by wfsp (Abbot) on Oct 14, 2008 at 11:41 UTC
    ...as 1 line.
    How about 44? :-) This uses a parser to get the data you need and a template to put it all back together again.

    Over the top? Possibly. I have a particular aversion to having any HTML in my code, even more so in a regex. It almost always ends in tears. This way I have no HTML in the code at all (the source and the template would normally be in separate files). YMMV.

    #!/usr/local/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; use HTML::Template; my $p = HTML::TokeParser::Simple->new(\get_html()); my ($in_pre, $pre); while (my $t = $p->get_token){ $in_pre++, next if $t->is_start_tag(q{pre}); next unless $in_pre; last if $t->is_end_tag(q{pre}); $pre .= $t->as_is; } my @lines = grep{/\S/} split /\n/, $pre; my $tmpl = HTML::Template->new(scalarref => \get_tmpl()); my @loop = map{{line => $_}} @lines; $tmpl->param(loop => \@loop); print $tmpl->output; sub get_html{ return <<HTML; <html> <pre> line 1 line 2 line 3 </pre> </html> HTML } sub get_tmpl{ return <<TMPL <html> <TMPL_LOOP loop> <p><TMPL_VAR line></p> </TMPL_LOOP> </html> TMPL }
    <html> <p>line 1</p> <p>line 2</p> <p>line 3</p> </html>
      Hi wfsp.

      I finally made time to test your solution, and thanks very much for your input. Nice work! While I don't think my situation warrants using your code, I may well use it in future if I have a more complex problem to deal with, and I appreciate the time you took to demonstrate this method.

      BTW: The single line processing requirement I gave was about the way I wanted to treat the htmlfile, rather than the number of lines of code.

      Thanks again.

Re: Substitution inside tags, as 1 line
by Perlbotics (Archbishop) on Oct 14, 2008 at 15:36 UTC

    ... or a naive one-liner state-machine ...

    perl -pe 'chomp; $s=!$s,next if s/^\s*<\/?pre>\s*$//i; $_="<p>$_</p>" +if $s; $_.="\n";' <in >out
    ... makes ...
    in: out: ---------------------------------- <html> <html> ...etc... ...etc... <pre> <p>Line 1</p> Line 1 <p>Line 2</p> Line 2 <p>...etc...</p> ...etc... <p>Line n</p> Line n ...etc... </pre> </html> ...etc... </html>

      Weeks later...

      I like it, Perlbotics.

      Thanks for that.