Substitution inside tags, as 1 line

tel2 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I have an HTML text file in this kind of format:

<html>
...etc...
<pre>
Line 1
Line 2
...etc...
Line n
</pre>
...etc...
</html>
[download]

I'd like to put a <p> tag at the end of each line which appears between the <pre> & </pre> tags, and finally remove the <pre> & </pre> tags, to give me this:

<html>
...etc...
Line 1<p>
Line 2<p>
...etc...<p>
Line n<p>
...etc...
</html>
[download]

I guess the simple and efficient way to do that would be to process the file line-by-line (and I might end up using such code), but first I'd like to see how it could be done treating the whole file as a single line. Here's the code I tried:

perl -0 -pe '1 while (s/(<pre>\n.*?)\n(.*<\/pre>)/$1<p>$2/ms);s/<\/?pr
+e>//g' htmlfile
[download]

That works, except it strips the \n chars out like this of course:

<html>
...etc...
<pre>
Line 1<p>Line 2<p>...etc...<p>Line n<p></pre>
...etc...
</html>
[download]

So I tried this:

perl -0 -pe '1 while (s/(<pre>\n.*?)\n(.*<\/pre>)/$1<p>\n$2/ms);s/<\/?
+pre>//g' htmlfile
[download]

But that loops infintely because it keeps on matching "Line 1", which becomes "Line 1<p><p><p>...".

How can I concisely write such code, processing htmlfile as a single line?

Thanks...Terry

Comment on Substitution inside tags, as 1 line Select or Download Code

Replies are listed 'Best First'.
Re: Substitution inside tags, as 1 line by Narveson (Chaplain) on Oct 14, 2008 at 04:38 UTC
Purists are cringing at your apparent belief that `<p>` marks the end of a paragraph. It marks the beginning of a paragraph, which is then terminated by `</p>`. Your confusion is widespread and pardonable, because the terminal `</p>` is optional, and your orphan line at the beginning will usually be rendered exactly like a paragraph. So here's how to do what you are trying to do: `s/(<pre>\n(?:[^\n]<p>\n))([^>\n])\n(.?<\/pre>)/$1$2<p>\n$3/ms` This assumes, as you do, that the opening `<pre>` is on a line of its own. I further assume that you start with no markup of any kind in your `<pre>` block. The substitution puts `<p>` at the end of each line that doesn't yet contain markup. I think my attempt may be the kind of thing you're looking for, but you may find further problems with this approach. Before you spend too much more time on this regex, I'd advise you to either process the file line-by-line (as you're already thinking of doing), or better yet, drop regexes altogether and learn about parsers.	[reply] [d/l] [select]
Re^2: Substitution inside tags, as 1 line by Anonymous Monk on Oct 14, 2008 at 07:14 UTC
Both m and s options on s///? `e Evaluate the right side as an expression. g Replace globally, i.e., all occurrences. i Do case-insensitive pattern matching. m Treat string as multiple lines. o Compile pattern only once. s Treat string as single line.` [download]	[reply] [d/l]
Re^3: Substitution inside tags, as 1 line by tel2 (Pilgrim) on Oct 14, 2008 at 09:01 UTC
From Perl Programming, 3rd Edition, by Larry Wall, etc, P153. `/m Let ^ and $ match next to embedded \n. /s Let . match newlines and ignore deprecated $*.` [download]	[reply] [d/l]
Re^2: Substitution inside tags, as 1 line by tel2 (Pilgrim) on Oct 14, 2008 at 08:44 UTC
Thanks Narveson, Nice work! So would your full answer be: - To put that in a while loop, and - Add the <pre> & </pre> tag removal code, like this: `perl -0 -pe '1 while (s/(<pre>\n(?:[^\n]<p>\n))([^>\n])\n(.?<\/pre +>)/$1$2<p>\n$3/ms);s/<\/?pre>//g' htmlfile` [download] ?	[reply] [d/l]
Re^3: Substitution inside tags, as 1 line by Narveson (Chaplain) on Oct 14, 2008 at 14:06 UTC
My full answer would be: Perhaps you'll manage to get this to work, but really, regexes, wonderful as they are, are the wrong tool here. I offered a bit of code in the spirit of "Don't you see how hairy this is going to have to be?" Parse your HTML. wfsp has been kind enough to furnish details.	[reply]
Re: Substitution inside tags, as 1 line by NetWallah (Canon) on Oct 14, 2008 at 06:30 UTC
Try using the flip-flop operator: `perl -pe 'm\|<pre>\|...m\|</pre>\| and $_.="<p/>"' < your-html-file` [download] Output : `<html> ...etc... <pre> <p/>Line 1 <p/>Line 2 <p/>...etc... <p/>Line n <p/></pre> <p/>...etc... </html>` [download] Have you been high today? I see the nuns are gay! My brother yelled to me...I love you inside Ed - Benny Lava, by Buffalax	[reply] [d/l] [select]
Re^2: Substitution inside tags, as 1 line by moritz (Cardinal) on Oct 14, 2008 at 06:35 UTC
Nice, but why do you use `<p/>`? That's equivalent to `<p></p>`, thus inserting an empty paragraph before each line, which sounds rather nonsensical to me. If you try to be correct about p nesting, then write `$_ = "<p>$_</p>"` instead. Or use `<br />` instead.	[reply] [d/l] [select]
Re^2: Substitution inside tags, as 1 line by tel2 (Pilgrim) on Oct 14, 2008 at 08:57 UTC
That's a nice looking alternative, NetWallah. Not what I had in mind, but it has a certain elegence & simplicity. Would you then suggest I put the output of that through a subsitution that removes the two `"</?pre>\n</p>"` matches, to finish off my requirements? How would you do it? Thanks.	[reply] [d/l]
Re^3: Substitution inside tags, as 1 line by NetWallah (Canon) on Oct 14, 2008 at 21:48 UTC
Slightly uglier, getting rid of the extra </p> around the <PRE>, but still readable: `perl -pe 'm\|<pre>\|..m\|</pre>\| and {m\|</?pre>\| or $_=qq\|<p>$_</p>\|} ' < + Your-file` [download] Also removed the unnecessary empty para (per mortiz, and kept it XHTML-compatible ! Have you been high today? I see the nuns are gay! My brother yelled to me...I love you inside Ed - Benny Lava, by Buffalax	[reply] [d/l] [select]
Re^4: Substitution inside tags, as 1 line by tel2 (Pilgrim) on Oct 16, 2008 at 01:20 UTC
Re: Substitution inside tags, as 1 line by wfsp (Abbot) on Oct 14, 2008 at 11:41 UTC
...as 1 line. How about 44? :-) This uses a parser to get the data you need and a template to put it all back together again. Over the top? Possibly. I have a particular aversion to having any HTML in my code, even more so in a regex. It almost always ends in tears. This way I have no HTML in the code at all (the source and the template would normally be in separate files). YMMV. #!/usr/local/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; use HTML::Template; my $p = HTML::TokeParser::Simple->new(\get_html()); my ($in_pre, $pre); while (my $t = $p->get_token){ $in_pre++, next if $t->is_start_tag(q{pre}); next unless $in_pre; last if $t->is_end_tag(q{pre}); $pre .= $t->as_is; } my @lines = grep{/\S/} split /\n/, $pre; my $tmpl = HTML::Template->new(scalarref => \get_tmpl()); my @loop = map{{line => $_}} @lines; $tmpl->param(loop => \@loop); print $tmpl->output; sub get_html{ return <<HTML; <html> <pre> line 1 line 2 line 3 </pre> </html> HTML } sub get_tmpl{ return <<TMPL <html> <TMPL_LOOP loop> <p><TMPL_VAR line></p> </TMPL_LOOP> </html> TMPL } [download] `<html> <p>line 1</p> <p>line 2</p> <p>line 3</p> </html>` [download]	[reply] [d/l] [select]
Re^2: Substitution inside tags, as 1 line by tel2 (Pilgrim) on Nov 05, 2008 at 07:23 UTC
Hi wfsp. I finally made time to test your solution, and thanks very much for your input. Nice work! While I don't think my situation warrants using your code, I may well use it in future if I have a more complex problem to deal with, and I appreciate the time you took to demonstrate this method. BTW: The single line processing requirement I gave was about the way I wanted to treat the htmlfile, rather than the number of lines of code. Thanks again.	[reply]
Re: Substitution inside tags, as 1 line by Perlbotics (Archbishop) on Oct 14, 2008 at 15:36 UTC
... or a naive one-liner state-machine ... `perl -pe 'chomp; $s=!$s,next if s/^\s<\/?pre>\s$//i; $_="<p>$_</p>" +if $s; $_.="\n";' <in >out` [download] ... makes ... `in: out: ---------------------------------- <html> <html> ...etc... ...etc... <pre> <p>Line 1</p> Line 1 <p>Line 2</p> Line 2 <p>...etc...</p> ...etc... <p>Line n</p> Line n ...etc... </pre> </html> ...etc... </html>` [download]	[reply] [d/l] [select]
Re^2: Substitution inside tags, as 1 line by tel2 (Pilgrim) on Nov 07, 2008 at 02:24 UTC
Weeks later... I like it, Perlbotics. Thanks for that.	[reply]