elvenwonder has asked for the wisdom of the Perl Monks concerning the following question:
Hello monks!
I'm a new user to this site, and have only been perl-ing for about two months, so please bear with me. Big picture, I'm writing a script for work which will take data from our webpage and generate gnuplot bar graphs. Small picture, I've found that there are blank rows in some of the HTML tables (way too much red tape to implement changes to the FORTRAN program that makes the HTML), and now need to change my regular expressions to catch those blank rows without catching single blank cells. I am not allowed to install modules (like HTML::TableExtract). I've pulled the html code into text files, and have regexes that pull out the necessary bits, but they get all fouled up on the occasional table that includes a blank line. Here is a sample of the text which I am searching:
<TR VALIGN="TOP"><TD><FONT SIZE="-1"> * </FONT></TD> <TD><FONT SIZE="-1"> MHS </FONT></TD> <TD><FONT SIZE="-1">125370</FONT></TD> <TD><FONT SIZE="-1">129114</FONT></TD> <TD><FONT SIZE="-1">131645</FONT></TD> <TD><FONT SIZE="-1">129546</FONT></TD> <TD><FONT SIZE="-1">515675</FONT></TD></TR> <TR VALIGN="TOP"><TD><FONT SIZE="-1"> * </FONT></TD> <TD><FONT SIZE="-1"> AIRS </FONT></TD> <TD><FONT SIZE="-1">626462</FONT></TD> <TD><FONT SIZE="-1">567621</FONT></TD> <TD><FONT SIZE="-1">614791</FONT></TD> <TD><FONT SIZE="-1">574009</FONT></TD> <TD><FONT SIZE="-1">2382883</FONT></TD></TR> <TR VALIGN="TOP"><TD><FONT SIZE="-1"></FONT></TD> <TD><FONT SIZE="-1"></FONT></TD> <TD><FONT SIZE="-1"></FONT></TD> <TD><FONT SIZE="-1"></FONT></TD> <TD><FONT SIZE="-1"></FONT></TD> <TD><FONT SIZE="-1"></FONT></TD> <TD><FONT SIZE="-1"></FONT></TD></TR>
I want to catch the third of those chunks. The following are some of the regexes I've tried--all of which (I think) should be equivalent, for all intents and purposes. I apologize for their ugliness. But they don't pull out the section I need, so:
while (<FILE>) { if( m#(<TR VALIGN="TOP"><TD></FONT></TD>.*</TR>)#sg ){ push(@search1, $+); } if( m#(<TR VALIGN="TOP"><TD><FONT SIZE="-1"></FONT></TD> <TD><FONT SIZE="-1"></FONT></TD> <TD><FONT SIZE="-1"></FONT></TD> <TD><FONT SIZE="-1"></FONT></TD> <TD><FONT SIZE="-1"></FONT></TD> <TD><FONT SIZE="-1"></FONT></TD> <TD><FONT SIZE="-1"></FONT></TD></TR>)#mg ){ push(@search2, $+); } if( m#(<TR VALIGN="TOP"><TD><FONT SIZE="-1"></FONT></TD>\s*\n* +\s*<TD><FONT SIZE="-1"></FONT></TD>\s*\n*\s*<TD><FONT SIZE="-1"></FON +T></TD>\s*\n*\s*<TD><FONT SIZE="-1"></FONT></TD>\s*\n*\s*<TD><FONT SI +ZE="-1"></FONT></TD>\s*\n*\s*<TD><FONT SIZE="-1"></FONT></TD>\s*\n*\s +*\n*\s*<TD><FONT SIZE="-1"></FONT></TD></TR>)#mg ){ push(@search3, $+); } if( m#(<TR VALIGN="TOP"><TD><FONT SIZE="-1"></FONT></TD>.*<TD> +<FONT SIZE="-1"></FONT></TD>.*<TD><FONT SIZE="-1"></FONT></TD>.*<TD>< +FONT SIZE="-1"></FONT></TD>.*<TD><FONT SIZE="-1"></FONT></TD>.*<TD><F +ONT SIZE="-1"></FONT></TD>.*<TD><FONT SIZE="-1"></FONT></TD></TR>)#sg + ){ push(@search4, $+); } }
I've tried variations on //m and //s, but it still doesn't catch. I would deeply appreciate any suggestions for a solution or revelations as to why I am wrong. Thanks in advance,
elvenwonder
|
|---|