How to use a regex to parse html form tags

monkeriffic has asked for the wisdom of the Perl Monks concerning the following question:

I have no problems eating cereal...after it softens. Why is replacing a simple string so hard then? In other areas of my life, like eating oatmeal and getting dressed, I have no real problems. Some might even say I am a savant.

But I am just beginning Perl, and things I think are easy turn out not to me. Now, (said in Scarface voice) Let me introduce you to my lil' friend!

My task is sooo deceptively simple: Just replace a simple string with another string. How hard could that be?

My data file is here: http://home.comcast.net/~tankomail/preg.htm And a sample is at the very bottom of this post. I just want to replace /<form[.*]?*\/form>/with the word "block"

Basically I just want to replace all <form> </form> fields and everything in between with nothing, but in testing, I wanted to see my work so I chose the word "block" as a good simple substitute which I could then replace with nothing.

Way Below is my base code. But here, just under is the pulled line from the base code that seems to be the issue:

$orgtext = Whey;  # this one right here 
$newtext = Popcorn;
[download]

The above works. I reduced it to it's simplest form as a sanity check. Then I tried:

$orgtext = /[Ww]hey/;  # this one right here
$newtext = Popcorn;
[download]

But beyond the most primitive replacement, I invariably get:

Use of uninitialized value in pattern match (m//) at C:\russ\scripts\_Master_Snippets\clean_2_input_output_file.pl line 9.

Eventually I want to try:

$orgtext = /<form[.*]?*\/form>/;  # this one right here
$newtext = block;
[download]

But I can't get past the staring blocks. I know this code works in general, but my modifications seem to break it.

I also tried some while (<$intext>) variations, even removing the undef $/ slurp line, so that the intext would receive the data line by line - but no luck anywhere. I have spent quite a bit of research time trying various things - but apparently it's not a trivial task.

Any suggestions as to:

1.) Is my basic model okay, slurping the whole file into a variable? or 2.) Should I use a while <> structure?

And even when I do get the simple Whey replaced with Popcorn - it only does the first instance, basically, I am guessing, because there is no iterative code in this script. And the only iterative examples I've seen are not with a whole file in one "intext" variable, but always with a while <> structure.

Your input and examples are GREATLY appreciated because the red spot on my banging against the cubicle wall head is growing.

L, Sam

---------------------------

Here is my base code.

$infile = 'C:\russ\weights\preg.htm';
$outfile = 'C:\russ\weights\preg_clean.htm';

# No, I am not pregnant, but I am helping a pregnant woman out! No...n
+ot just helping her get
# her start either :)

$orgtext = Whey;
$newtext = Popcorn;

undef $/; #slurp mode, read files in a whole

open IN, $infile or die $!;
$intext = <IN>;
close IN;

$intext =~ s/$orgtext/$newtext/ms;
# the ms is for coping correctly with newlines (that can easily appear
+ in a binary).

open OUT, ">$outfile" or die $!;
print OUT $intext;
close OUT;

# replaces ALL occurrences of orgtext with newtext and places the numb
+er of occurences in $count
[download]

--------data sample. link to complete data above


<table width="100%" border="0" align="center" cellpadding="0" cellspac
+ing="0">
  <tr>
    <td width="24%" rowspan="2" valign="top"><table width="198" border
+="0" align="center" cellpadding="1" cellspacing="0">
        <tr>
          <td><div align="center"><img src="images/ls_logo.gif" width=
+"192" height="91"></div></td>
        </tr>
        <tr>
          <td valign="top"><table width="200" border="0" align="center
+" cellpadding="2" cellspacing="3">
              <tr>
                <td width="197" valign="top"><table width="100%" borde
+r="0" align="center" cellpadding="0" cellspacing="0">
                    <tr>
                      <td><div align="center"><a href="all-products.ht
+ml"><img src="2005-menu/all-prods.gif" name="all" width="177" height=
+"33" border="0"></a></div></td>
                    </tr>
                    <tr>
                      <td><div align="center"><a href="vitamins-supple
+ments.html"><img src="2005-menu/vits-supl.gif" name="vitamins" width=
+"177" height="33" border="0"></a></div></td>
                    </tr>
                    <tr>
                      <td><div align="center"><a href="liquid-suppleme
+nts.html"><img src="2005-menu/liquid-vit.gif" name="liquid" width="17
+7" height="33" border="0"></a></div></td>
                    </tr>
                    <tr>
                      <td><div align="center"><a href="body-building.h
+tml"><img src="2005-menu/body-build.gif" name="bodybuild" width="177"
+ height="33" border="0"></a></div></td>
                    </tr>
                    <tr>
                      <td><div align="center"><a href="weightloss.html
+"><img src="2005-menu/diet.gif" name="diet" width="177" height="33" b
+order="0"></a></div></td>
                    </tr>
                    <tr>
                      <td><div align="center"><a href="body-essentials
+.html"><img src="2005-menu/body-ess.gif" name="bodyess" width="177" h
+eight="33" border="0"></a></div></td>
                    </tr>
                    <tr>
                      <td><div align="center"><a href="articles.html">
+<img src="2005-menu/articles.gif" alt="Articles of Interest" name="ar
+ticles" width="177" height="33" border="0"></a></div></td>
                    </tr>
                  </table></td>
              </tr>
              <tr>
                <td><div align="center">

                  </div></td>
              </tr>
              <tr>
                <td><div align="center"> <form method=POST style="marg
+in-bottom: 0" action="https://www.linkpointcart.net/cgi-bin/cart.cgi"
+>
                            <input type=hidden name="ViewCart" value="
+ThreadsCart">
                            <input type=submit value="View Cart">
                          </form></div></td>
              </tr>
              <tr>
                <td><div align="center"><form method=POST style="margi
+n-bottom: 0" action="https://www.linkpointcart.net/cgi-bin/cart.cgi">
    <input type=hidden name="CheckOut" value="Online">
    <input type=hidden name="CartID" value="ThreadsCart">
    <input type=submit value="Check Out">
</form></div></td>
              </tr>
              <tr>
                <td><table width="100%" border="0" cellspacing="0" cel
+lpadding="0">
                    <tr>
                      <td><br><div align="center"><a href="catalog.htm
+l"><img src="2005-menu/catalog-banner.gif" width="196" height="50" bo
+rder="0"></a></div></td>
                    </tr>
                  </table>
                  <div align="center"><font size="2" face="Arial, Helv
+etica, sans-serif"><strong><br>
                    We want to hear from you.<br>
                    Suggest a NEW PRODUCT!!<br>
                    <a href="suggest.htm">:: click here::</a></strong>
+</font></div></td>
              </tr>
            </table></td>
        </tr>
      </table></td>
    <td width="76%" height="28" valign="top"><div align="right"><img s
+rc="2005-menu/top-image.gif" width="604" height="98" border="0" usema
+p="#Map"></div></td>
  </tr>
  <tr>
    <td valign="top"><br> <!-- InstanceBeginEditable name="content" --
+>
      <table width="90%" border="0" align="center" cellpadding="1" cel
+lspacing="1">
        <tr>
          <td><table width="560" border="0" align="center" cellpadding
+="3" cellspacing="0">
              <tr>
                <td rowspan="2" valign="top"><div align="center"><img 
+src="bottles/whey-chocolate-s.gif" width="102" height="150" border="0
+"><br>
                    <font color="#666666" size="1" face="Arial, Helvet
+ica, sans-serif"></font></div></td>
                <td><div align="left"><font size="2" face="Arial, Helv
+etica, sans-serif"><strong><font size="3">Whey
                    Protein<br>
                    Chocolate 3.3 lbs.</font><br>
                    54 grams of protein per serving<br>
                    <br>
                    </strong></font><font face="Verdana, Arial, Helvet
+ica, sans-serif"><strong><font size="3" face="Arial, Helvetica, sans-
+serif">$
                    39.99</font></strong></font></div></td>
                <td rowspan="2" valign="top"><div align="center"><img 
+src="bottles/whey-vanilla-s.gif" width="102" height="150" border="0">
+<br>
                  </div></td>
                <td><div align="left"><font size="2" face="Arial, Helv
+etica, sans-serif"><strong><font size="3">Whey
                    Protein<br>
                    Vanilla </font><font size="2" face="Arial, Helveti
+ca, sans-serif"><strong><font size="3">3.3
                    lbs.</font></strong></font><br>
                    54 grams of protein per serving.<br>
                    <br>
                    </strong></font><font face="Verdana, Arial, Helvet
+ica, sans-serif"><strong><font size="3" face="Arial, Helvetica, sans-
+serif">$
                    39.99</font></strong></font><font size="2" face="A
+rial, Helvetica, sans-serif"><strong>
                    </strong></font></div></td>
              </tr>
              <tr>
                <td><form method="post" action="https://www.linkpointc
+art.net/cgi-bin/cart.cgi">
                    <table border="0" cellpadding="0" cellspacing="0">
                      <tr>
                        <td><font size="2" face="Arial, Helvetica, san
+s-serif">Quantity:</font></td>
                        <td><font face="Verdana, Arial, Helvetica, san
+s-serif">
                          <input type="text" name="VARQuantity" value=
+"1" size="4" />
                          </font></td>
                      </tr>
                      <tr>
                        <td colspan="2" align="center"> <font face="Ve
+rdana, Arial, Helvetica, sans-serif">
                          <input type="hidden" name="VAR000" value="|"
+ />
                          <input type="hidden" name="AddItem" value="T
+hreadsCart|Lifesource Labs - Whey Protein Powder Chocolate VAR000 $39
+.99|VARQuantity|||price5|||||||" />
                          <input name="submit" type="submit" value="Ad
+d To Cart" />
                          </font></td>
                      </tr>
                    </table>
                  </form></td>
                <td><form method="post" action="https://www.linkpointc
+art.net/cgi-bin/cart.cgi">
                    <table border="0" cellpadding="0" cellspacing="0">
                      <tr>
                        <td><font size="2" face="Arial, Helvetica, san
+s-serif">Quantity:</font></td>
                        <td><font face="Verdana, Arial, Helvetica, san
+s-serif">
                          <input type="text" name="VARQuantity2" value
+="1" size="4" />
                          </font></td>
                      </tr>
                      <tr>
                        <td colspan="2" align="center"> <font face="Ve
+rdana, Arial, Helvetica, sans-serif">
                          <input type="hidden" name="VAR000" value="|"
+ />
                          <input type="hidden" name="AddItem" value="T
+hreadsCart|Lifesource Labs - Whey Protein Powder Vanilla VAR000 $39.9
+9|VARQuantity|||price5|||||||" />
                          <input name="submit" type="submit" value="Ad
+d To Cart" />
                          </font></td>
                      </tr>
                    </table>
                  </form></td>
[download]

2006-10-14 Retitled by GrandFather, as per Monastery guidelines
Original title: 'I have no problems eating cereal...after it softens. Why is replacing a simple string so hard then?'

Comment on How to use a regex to parse html form tags Select or Download Code

Replies are listed 'Best First'.
Re: How to use a regex to parse html form tags by GrandFather (Saint) on Oct 14, 2006 at 01:17 UTC
That ain't no simple string. That are HTML, and HTML ain't so simple. In fact, if you have anything else you want to do for the rest of the week, don't bother trying to write code to parse HTML. Use something like HTML::TreeBuilder instead. However, even before that you should take a regex refresher course. I'd recommend that you start with perlretut, perlrequick, perlre and perlreref. In the mean time `$intext =~ s/$orgtext/$newtext/ms;` probably wants a /g to replace all the strings. DWIM is Perl's answer to Gödel	[reply] [d/l]
Re: How to use a regex to parse html form tags by Zaxo (Archbishop) on Oct 14, 2006 at 01:19 UTC
You ought to use something like HTML::Parser for that. It is notoriously difficult to decode html with regular expressions. That said, your problem is limited enough that your approach will probably work once you tell the regex engine to substitute globally. Do that by using the /g modifier to the substitution, as well as /s. I'm not sure you need /m but I don't think it does any harm. `$intext =~ s/$orgtext/$newtext/gims;` [download] I also made it ignore case in matching, just in case. After Compline, Zaxo	[reply] [d/l]
Re: How to use a regex to parse html form tags by ammon (Sexton) on Oct 14, 2006 at 01:29 UTC
1.) Is my basic model okay, slurping the whole file into a variable? or 2.) Should I use a while <> structure? That depends on the size of your file. If it's large, you might want to use the while() loop. And even when I do get the simple Whey replaced with Popcorn - it only does the first instance, basically, I am guessing, because there is no iterative code in this script. The iterative code you're looking for is a `/g` on the end of your `s///`. However, that answer isn't going to solve your problems. If you want to eat non-mushy cereal, you need to put your dentures in. Use perl's dentures at the top of your code: `use strict; use warnings;` [download] Finally, your eventual replacement pattern, `/<form[.]?\/form>/` is flawed, and will not do what you appear to think it will do. The square brackets, `[]`, indicate a character class, so what you're matching is `<form`, followed by either a literal `.` or a literal ``, or nothing, followed by `/form>`. The regex you are looking for() is `m{<form.?</form>}`. Note the use of an alternative delimiter so the `/` doesn't need to be escaped. () You are not looking for a regex, which are insufficient for parsing gobs of HTML/XML. The regex given above already has problems. :) Cheers,	[reply] [d/l] [select]
Re: How to use a regex to parse html form tags by bart (Canon) on Oct 14, 2006 at 07:19 UTC
I just want to replace `/<form[.]?\/form>/` with the word "block" Well, there's your first huge mistake (apart from the question whether this is a good idea). You're now looking for "<form", followed by any number of occurrences of either "." or "", and ending in "/form>". Or, it would if you put in your quantifiers in the correct way: `/<form[.]?\/form>/` [download] It makes sense that it finds nothing. Where does the idea come from to use square brackets, anyway? Uh, yes, "." is indeed a plain character in a character class. That might be an unexpected pitfall. So, try again with `/<form.?\/form>/s` [download] The `/s` is to treat newlines as plain characters in `/./s`. (You put it after the third slash in `s/PAT/REPL/s`.)	[reply] [d/l] [select]
Re: How to use a regex to parse html form tags by graff (Chancellor) on Oct 14, 2006 at 07:20 UTC
Well, whether or not your data and your intended edit are really as simple as you think they should be, some of your code gives me the impression that you are missing a few details about syntax involved with regular expressions. In particular: `$orgtext = Whey; # this one right here $newtext = Popcorn;` [download] The above works. I reduced it to it's simplest form as a sanity check. Then I tried: `$orgtext = /[Ww]hey/; # this one right here $newtext = Popcorn;` [download] The second statement for "$orgtext" is equivalent to: `$orgtext = ( $_ =~ /[Ww]hey/ );` [download] If nothing was ever assigned to $_ at the time your second $orgtext statement executed, you get the warning because perl is doing a regex match on $_, and $_ is undefined because no value has been assigned to it. Maybe that second example should have been something like this? `$orgtext = qr/[Wwhey]/;` [download] As for what you "eventually want to try": `$orgtext = /<form[.]?\/form>/; # this one right here $newtext = bloc +k;` [download] That assignment to $orgtext has not only the same problem cited above (use "qr/.../" instead?), but also problems with improper use of the square brackets, period, question mark and asterisks. I think what you intended was something like this: `$orgtext = qr/<form.*?\/form>/sg; # s is needed so that "." matches ne +wline` [download] But as others have pointed out, an HTML parsing module, once you get the hang of it, is probably a better approach. Next there's this in your "base code": `$intext =~ s/$orgtext/$newtext/ms; # the ms is for coping correctly with newlines (that can easily appear + in a binary). ... # replaces ALL occurrences of orgtext with newtext and places the numb +er of occurences in $count` [download] The comment about "that can easily appear in a binary" makes no sense, and the use of "m" on that substitution is actually pointless -- it would be relevant if you were anchoring the regex match with "^" or "$", because "m" alters the behavior of those anchors, but you're not using them. (The "whey" example doesn't use "." either, so the "s" is pointless as well in that case, but it's harmless, and I know you eventually expected to use ".") And then, based on your closing comment (which refers to a "$count" variable that doesn't appear anywhere), it looks like you should have used the "g" modifier on that substitution, so that all occurrences of "$orgtext" would be changed to "$newtext".	[reply] [d/l] [select]
Re: How to use a regex to parse html form tags by fenLisesi (Priest) on Oct 14, 2006 at 10:37 UTC
I am not very knowledgeable on pregnancy-nutrition requirements, but the following may get you going with your removing-the-forms requirement: `use strict; use warnings; use WWW::Mechanize; use HTML::TreeBuilder; my $URL = q(http://home.comcast.net/~tankomail/preg.htm); my $mech = WWW::Mechanize->new(); $mech->get( $URL ); my $tree = HTML::TreeBuilder->new(); $tree->parse( $mech->content() ); my @forms = $tree->find('form'); foreach my $form (@forms) { my $parent = $form->parent(); $parent->push_content("BLOCKED"); $form->delete(); } print $tree->as_HTML(); $tree->delete(); __END__` [download] You may need another tool to tidy up the html that this produces.	[reply] [d/l]
Re: How to use a regex to parse html form tags by blazar (Canon) on Oct 14, 2006 at 10:25 UTC
I have no problems eating cereal...after it softens. Why is replacing a simple string so hard then? In other areas of my life, like eating oatmeal and getting dressed, I have no real problems. Some might even say I am a savant. Since you posted the very same question on clpmisc, it may be fair and gentle enough to report both there and in the Monastery having done so, to help people avoiding duplicate efforts. All in all this may be considered a form of "cross-media multiposting"...	[reply]