huge multiline regex

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi guys.

I recently redone the template of my web site and I have these hugs articles that have 30+ replacements I'd have to do to each of them. It'd take 10-15 hours to do it manually.

I want to write a small perl script to substitute this chunk of HTML

              <blockquote>
                <DIV class=code_box>
                  <DIV class=code_box_header><font size="2" face="Verd
+ana, Arial, Helvetica, sans-serif">code</font></DIV>
                  <font size="2" face="Verdana, Arial, Helvetica, sans
+-serif">all other text and stuff goes here</font></DIV>
              </blockquote>
[download]

With

          <BLOCKQUOTE>
            <p class="style2">all other text and stuff goes here</p>
          </BLOCKQUOTE>
[download]

The thing is, I need to capture the stuff inside the 2nd font tag and apply the above blockquote and pclass to it.

I can open up my 300+ html files and slurp them up just fine but my regex skills aren't that good to match these multilines while capturing only the text/code in the second font tag (there will always be just 2 if it matches).

Comment on huge multiline regex Select or Download Code

Replies are listed 'Best First'.
Re: huge multiline regex by wfsp (Abbot) on Jun 23, 2006 at 03:42 UTC
This uses HTML::TokeParser::Simple. It finds the text within the second font tag within a blockquote #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $html; { local $/; $html = <DATA>; } my $p = HTML::TokeParser::Simple->new(\$html); my (@content, $in_bq); my $font_tag = 0; while (my $t = $p->get_token){ $in_bq++, next if $t->is_start_tag('blockquote'); next unless $in_bq; $font_tag++, next if $t->is_start_tag('font'); push @content, $t->as_is if $t->is_text and $font_tag == 2; ($in_bq, $font_tag) = (0,0) if $font_tag == 2; } print "@content"; __DATA__ <blockquote> <DIV class=code_box> <DIV class=code_box_header> <font size="2" face="Verdana, Arial, Helvetica, sans-serif"> code </font> </DIV> <font size="2" face="Verdana, Arial, Helvetica, sans-serif"> all other text and stuff goes here </font> </DIV> </blockquote> <blockquote> <DIV class=code_box> <DIV class=code_box_header> <font size="2" face="Verdana, Arial, Helvetica, sans-serif"> code </font> </DIV> <font size="2" face="Verdana, Arial, Helvetica, sans-serif"> all other text and stuff goes here </font> </DIV> </blockquote> [download]	[reply] [d/l]
Re: huge multiline regex by davido (Cardinal) on Jun 23, 2006 at 02:18 UTC
There are too many variables involved with a free form markup such as HTML; white-space can fall in arbitrary places, including within tags, tag attributes can change, and even the markup can change without altering the intent of the underlying text. While regular expressions are great for pattern matching, what you're doing is going beyond pattern matching, to markup parsing. Regular expressions might comprise a portion of a full fledged markup parser, but they're not usually a complete solution. You really ought to be using something more robust than a fragile regular expression approach. HTML::TokeParser and HTML::Parser are two possible alternatives, both of which can handle the intricate nuances of HTML. Regular expressions that handle all the possibilities are difficult to construct correctly, and fragile. An HTML parser is a more suitable tool for the job. Dave	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: huge multiline regex by GrandFather (Saint) on Jun 23, 2006 at 02:28 UTC
Don't use regexen, use XML::Twig if your stuff is XHTML compliant or use something like HTML::TreeBuilder otherwise. If you are interested in the TreeBuilder route take a look at Re^3: regex for search and replace of words in HTML and the replies to How do I perform a global substitute in an HTML::Element for some ideas. DWIM is Perl's answer to Gödel	[reply]
Re: huge multiline regex by rsriram (Hermit) on Jun 23, 2006 at 06:05 UTC
Try this one after removing all the carriage returns in the HTML file. Assuming that all the contents are stored in `$file`, `$file =~ s/<blockquote><DIV ([^>]+)><DIV ([^>]+)><font ([^>]+)>code<\/font><\/DIV><font ([^>]+)>(.+)<\/font><\/DIV><\/blockquote>/<BLOCKQUOTE><p class="style2">$5<\/p><\/BLOCKQUOTE>/g;` Sriram	[reply] [d/l] [select]