elvenwonder has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks!

I'm a new user to this site, and have only been perl-ing for about two months, so please bear with me. Big picture, I'm writing a script for work which will take data from our webpage and generate gnuplot bar graphs. Small picture, I've found that there are blank rows in some of the HTML tables (way too much red tape to implement changes to the FORTRAN program that makes the HTML), and now need to change my regular expressions to catch those blank rows without catching single blank cells. I am not allowed to install modules (like HTML::TableExtract). I've pulled the html code into text files, and have regexes that pull out the necessary bits, but they get all fouled up on the occasional table that includes a blank line. Here is a sample of the text which I am searching:

<TR VALIGN="TOP"><TD><FONT SIZE="-1"> * </FONT></TD> <TD><FONT SIZE="-1"> MHS </FONT></TD> <TD><FONT SIZE="-1">125370</FONT></TD> <TD><FONT SIZE="-1">129114</FONT></TD> <TD><FONT SIZE="-1">131645</FONT></TD> <TD><FONT SIZE="-1">129546</FONT></TD> <TD><FONT SIZE="-1">515675</FONT></TD></TR> <TR VALIGN="TOP"><TD><FONT SIZE="-1"> * </FONT></TD> <TD><FONT SIZE="-1"> AIRS </FONT></TD> <TD><FONT SIZE="-1">626462</FONT></TD> <TD><FONT SIZE="-1">567621</FONT></TD> <TD><FONT SIZE="-1">614791</FONT></TD> <TD><FONT SIZE="-1">574009</FONT></TD> <TD><FONT SIZE="-1">2382883</FONT></TD></TR> <TR VALIGN="TOP"><TD><FONT SIZE="-1"></FONT></TD> <TD><FONT SIZE="-1"></FONT></TD> <TD><FONT SIZE="-1"></FONT></TD> <TD><FONT SIZE="-1"></FONT></TD> <TD><FONT SIZE="-1"></FONT></TD> <TD><FONT SIZE="-1"></FONT></TD> <TD><FONT SIZE="-1"></FONT></TD></TR>

I want to catch the third of those chunks. The following are some of the regexes I've tried--all of which (I think) should be equivalent, for all intents and purposes. I apologize for their ugliness. But they don't pull out the section I need, so:

while (<FILE>) { if( m#(<TR VALIGN="TOP"><TD></FONT></TD>.*</TR>)#sg ){ push(@search1, $+); } if( m#(<TR VALIGN="TOP"><TD><FONT SIZE="-1"></FONT></TD> <TD><FONT SIZE="-1"></FONT></TD> <TD><FONT SIZE="-1"></FONT></TD> <TD><FONT SIZE="-1"></FONT></TD> <TD><FONT SIZE="-1"></FONT></TD> <TD><FONT SIZE="-1"></FONT></TD> <TD><FONT SIZE="-1"></FONT></TD></TR>)#mg ){ push(@search2, $+); } if( m#(<TR VALIGN="TOP"><TD><FONT SIZE="-1"></FONT></TD>\s*\n* +\s*<TD><FONT SIZE="-1"></FONT></TD>\s*\n*\s*<TD><FONT SIZE="-1"></FON +T></TD>\s*\n*\s*<TD><FONT SIZE="-1"></FONT></TD>\s*\n*\s*<TD><FONT SI +ZE="-1"></FONT></TD>\s*\n*\s*<TD><FONT SIZE="-1"></FONT></TD>\s*\n*\s +*\n*\s*<TD><FONT SIZE="-1"></FONT></TD></TR>)#mg ){ push(@search3, $+); } if( m#(<TR VALIGN="TOP"><TD><FONT SIZE="-1"></FONT></TD>.*<TD> +<FONT SIZE="-1"></FONT></TD>.*<TD><FONT SIZE="-1"></FONT></TD>.*<TD>< +FONT SIZE="-1"></FONT></TD>.*<TD><FONT SIZE="-1"></FONT></TD>.*<TD><F +ONT SIZE="-1"></FONT></TD>.*<TD><FONT SIZE="-1"></FONT></TD></TR>)#sg + ){ push(@search4, $+); } }

I've tried variations on //m and //s, but it still doesn't catch. I would deeply appreciate any suggestions for a solution or revelations as to why I am wrong. Thanks in advance,

elvenwonder

Replies are listed 'Best First'.
Re: Specific Regex with Multilines (/s and /m): Why Doesn't This Work?
by ikegami (Patriarch) on Jul 18, 2011 at 17:20 UTC

    if (/.../g) makes no sense conceptually and can lead to really weird behaviour. Get rid of those "g".

    Some of your regex patterns contain newlines, yet the string against which you are matching contains at most one at the end.

    Your match operators that have /m don't have "^" or "$" in the pattern, making the /m completely useless.

      Thank you--I thought that because I was wanted all of the matches (there are several in the full text to be searched), I needed /g. But I don't because I'm pushing each into an array individually, right?

        But I don't because I'm pushing each into an array individually, right?

        Depends. As long as you can only have one match per line, then yes, the following would do:

        while (<>) { if (/.../) { push @matches, ...; } }

        If you can have multiple matches per line, you'd need something like:

        while (<>) { while (/.../g) { push @matches, ...; } }

        It's moot, however, since you want to match text that span multiple lines. You want something more like the following:

        my $file; { local $/; $file = <>; } # Slurp file. while ($file =~ /.../g) { push @matches, ...; }
Re: Specific Regex with Multilines (/s and /m): Why Doesn't This Work?
by toolic (Bishop) on Jul 18, 2011 at 16:52 UTC

      Thanks for quick response!

      To the first: I'm not allowed to install them on my home directory, either. To the second: my reason is a version of "IT Reluctance"/"Manager Support". I'm most of the way through the script without the module, and so (unless for some reason this particular problem is entirely unsolveable otherwise) I'm going to do these few lines of code without it. I mentioned the modules because early in the project I read about the module, tried to get permission to use it, and could not.

      In the end, though, my frustration is over the logic problem. Even copy-pasting the code fails to find a match, and clearly I must be doing something wrong in order for that to be so.

Re: Specific Regex with Multilines (/s and /m): Why Doesn't This Work?
by mcrose (Beadle) on Jul 18, 2011 at 16:24 UTC
    Are you not allowed to install modules into the system @INC, or are you not even able to use modules installed to your user-specific home directory? If you can do the latter, you could just extract HTML::TableExtract to a @INC directory and specify it in your script with 'use lib', or use local::lib and cpanm to handle installation for you automatically.
Re: Specific Regex with Multilines (/s and /m): Why Doesn't This Work?
by jethro (Monsignor) on Jul 18, 2011 at 17:41 UTC

    Did you undefine the input record separator i.e. "undef $/;". If this is not the case you would be trying to match a multiline regex (i,e, your example chunk has 7 lines), but only reading the file line by line. So obviously the match would fail as you are comparing a single line with a pattern that expects more than one line.

    Naturally in that case you have to use //s. Also the while loop has to change so that it loops as long as you find something with the regex, the file itself is read only once

      If the HTML files are not so large as to create a RAM issue, read the entire file into a scalar and:
      $myHTML =~ s#<TR VALIGN="TOP">[\r\n]*(?:<TD><FONT SIZE="-1"></FONT></T +D>[\r\n]*){7}</TR>##g ;
      Then process $myHTML.

      Ah, the undef $/; is what I was missing on the multiline search. The problem now is that I need to return it to it's default behavior in order to perform the rest of my regular expressions properly. I find it frustrating that Google won't allow me to search for "$/". Thank you jethro (et al).

        local $/; within the appropriate scope is almost always more useful than undef $/;.

A reply falls below the community's threshold of quality. You may see it by logging in.