deryni has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to pull some data out of an HTML filee.

Before I get to the specific problem I would like to say that, yes I realize that this is far from optimal in more ways then one.
I have read Death to Dot Star! and realize that this regex is probably horrendously inefficient. I have also been advised that HTML::TableExtract is a better way to get data out of an html table then is a regex.

With all that in mind I would like to ask for help with this problem.
/^<(?:[Tt][Rr]).*?>(\d{5}).*?>(\d{2}).*?>(\d{3}).*?>(\d{3}).*?>(\d{2}).*?>(?:[&\w]).*?>(\w+(?:(?:[\s\w|&]+)?)*).*?>\s(\d).*?>(\w*?\d(?:[,\d\*]?)*)((?:[\w\d,]?)+).*?>(\w(?:(?:[\w\d-])?)*).*?<\/[tT][rR]>(<.*)?/ I am using that regex to pull the information out of an webpage, with the following line format (all newlines are mine to ease readbility).

<tr><td width="0" align="center"><font face="Arial" size="2">5 Digits +</font></td> <td width="0" align="center"><font face="Arial" size="2">2 Digits < +/font></td> <td width="0" align="center"><font face="Arial" size="2">3 Digits </ +font></td> <td width="0" align="center"><font face="Arial" size="2">3 Digits </ +font></td> <td width="0" align="center"><font face="Arial" size="2">2 Digits < +/font></td> <td width="0" align="center"><font face="Arial" size="2">&nbsp;</font> +</td> <td align="center"><font face="Arial" size="2">AS tring</font></td> <td align="center"><font face="Arial" size="2"> 1 Digit always precede +d by a space </font></td> <td align="center"><font face="Arial" size="2">Letters, Digits, (comma +s|asterisks)?</font></td> <td align="center"><font face="Arial" size="2">A String always includi +ng a dash</font></td> <td align="center"><font face="Arial" size="2">&nbsp;</font></td></tr>
Now here's the problem, the 9th piece of data is on occasion the string "SEE SCHEDULE OF CLASSES" and the tenth will then be &nbsp; the regex is the logic statement of an if statement.
  while <FILE> {
    if (REGEX) {
    do stuff
    }
  }
The problem comes in when a file with this alternate format is input. The script hangs and then says that there was an internal server error. I managed to get around this by using a different regex to first test for the "SEE SCHEDULE OF CLASSES" string and if it exists simply going to the next line of the file.

My question is why does my regex simply not match, return false, the if not execute, and the loop continue?
Thanks in advance for any and all help.

   -Etan

Replies are listed 'Best First'.
Re: Problem with CGI script not working (regex at fault)
by tachyon (Chancellor) on Jul 29, 2001 at 16:57 UTC

    I could speculate all day but if you add this block of code to the begining of your script then Perl will tell you exactly what the problem is in the browser window rather than give you a 500 internal server error. Perhaps you might like to add this, then post the result if you can't answer the problem yourself with the info you will get?

    # ensure all fatals go to browser during debugging and setup # *don't* uncomment these lines on production code for security BEGIN { $|=1; print "Content-type: text/html\n\n"; use CGI::Carp('fatalsToBrowser'); }

    As an aside the /i modifier makes the regex case insensitive which is what you are doing in a fairly obscure way with this bit:

    /^<(?:[Tt][Rr])...../ # it would be much easier to read if you had /^<(?:TR).........../i

    Also I really doubt you need/want this at the end of your regex:

    (<.*)?

    I guess the reason for suggesting using a module written for the task is that it is likely to be more reliable and robust than a regex solution.

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

      First off it's quite a "Duh!" moment when someone points out something you just hadn't thought of but really should have, (i.e. the /i switch), thanks. Let's hope I remember it in the future.

      I put in the BEGIN block you suggested and got nothing, absolutely nothing printed to the browser.

      As for the (<.*)? at the end, I need that there because they're are special circumstances where there is a complimentary bit of information stuck on the end of the line, and I put that in to slurp it up.

      Thanks for the help and advice so far, now if I could only figure out what the problem was.

         -Etan

        If nothing prints to the browser then you have a problem with your script that has nothing to do with your regex. The BEGIN block prints a valid header (as every CGI script must). It does not stop your script from printing a valid header, so *at the very least* you should get some header info in the browser window - this is the header info you script would output without the BEGIN block. If you do not your script is not printing a valid header (nor anything else if you do not see anything in the browser window). You will need to post a link to the full script or post it here as the problem is not the regex as you suggest.

        cheers

        tachyon

        s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

(ichimunki) Re: Problem with CGI script not working (regex at fault)
by ichimunki (Priest) on Jul 29, 2001 at 19:55 UTC
    You may not want to hear this, but you are just making your life difficult doing this as a regex-- every little permutation in the source HTML is going to potentially break your RE. Why don't you use a module like HTML::TokeParser, which PerlMonks even has this excellent Tutorial about? That way you can run through a pattern of tokens (tags), and when you get into the set of tokens you need, you can run your regex against the text itself without having to worry about all the rest of that stuff.

    I find it hard to believe that changing your input alone can cause your script to hang. Unless you have built in some later use of the variables set by this regex, and haven't checked to make sure they actually get set in the cases when the regex fails to match. So maybe is that happening?
Re: Problem with CGI script not working (regex at fault)
by tadman (Prior) on Jul 29, 2001 at 17:01 UTC
    I think you've painted yourself into a corner. Monster regexes can be constructed to parse HTML and tables, but they are quite difficult to perfect. As you have noted, HTML::TableExtract is a far superior way to what you have tried to create.

    The HTML::TableExtract module is surprisingly simple to use. You really should give it a shot, as it will likely take less time to figure out how to use it than it would to diagnose your problem.

    The reason why it may not match and return false is because it's spinning trying to find a match. Maybe there's a billion different ways to try and get a match, and it will investigate them all just to be sure. Doing a series of mini-matches is much better, and using HTML::Parser is better yet.
Re: Problem with CGI script not working (regex at fault)
by dsb (Chaplain) on Jul 29, 2001 at 17:14 UTC
    "why does my regex simply not match, return false, the if not execute, and the loop continue?"--

    First of all, the code is working like that, because that is what you have it programmed to do. Your regular expression is failing. When a regex fails it return a false value which you if is evaluating, seeing is obviously not true, and not executing its conditional block.

    So, the tricky part is, why does your regex not match? Well, there are a couple of things. For one, you are using the '.*?' construct quite a bit. You mentioned the Death to Dot Star! node, but the construct you are using is different than .* is a very imortant way. The question mark makes the .*, non--greedy...matching as little as it can. It looks like you wanted to use the greedy nature of .* to your advantage, but you added the question mark, changing its nature.

    Watch what happens with these two examples:

    $str = "<tr><td width="0" align="center"><font face="Arial" size="2">5 + Digits </font></td>"; if ($str =~ m/(<[Tt][Rr].*>)/ ) { # using .* print $1, "\n"; } # or $str = "<tr><td width="0" align="center"><font face="Arial" size="2">5 + Digits </font></td>"; if ($str =~ m/(<[Tt][Rr].*?>)/ ) { # using .*? print $1, "\n"; }
    The regexes in both example will succeed, but the output will be very different.

    One more thing: You are saying while (<FH>), which is all well and good, but your regex is testing the entire contents of the table. Unless that table is all on one line of the file your regex has way too much in it.

    Now, knowing what you now know about greediness and non-greediness, go back and tweak your regex.

    Amel - f.k.a. - kel

      I was perhaps unclear in what I wanted to happen. I wanted the regex to fail on the lines it did not match, then return a false value to be evaluated by the if, have the if evaluate to false and skip it's contents, which would then allow the while loop to continue onto it's next iteration.
      Each line of the input file does indeed consist of one (except in the special cases mentioned in my response to tachyon when it consists of two) table row(s).

      Thank you for being an ever vigilant watchdog. The perlmonks community is well served by those who keep such careful, and coureous, watch over it's supplicants.

         -Etan
(tye)Re: Problem with CGI script not working (regex at fault)
by tye (Sage) on Jul 30, 2001 at 21:37 UTC

    Your problem description makes me suspicious that your regex is taking too many resources in the case you site.

    You see, you neglected one of the possible outcomes that becomes much more likely when you use such a horrid regex. The possible outcomes of using a regex are:

    • Regex matches
    • Regex determines that it can't match
    • Regex keeps looking for match and not finding one and takes forever backtracking, consuming more resources until some limit is passed and the script is killed.

    I don't understand why you refuse to use a module designed to deal with this type of problem. But at least follow the advice given in the node you reference! I suspect that many of your uses of .*? can be replaced with [^>]* or [^<]*, for example. And something like [^>]*> never has to backtrack.

    Each time you use .* or .*?, you give the regex one more place to try backtracking. One such place may mean the regex backtracks over the whole string once. Two such places can end up with the first place backtracking over the whole string and at each point in the string, the second one could backtrack over one side of the string.

    So, on a string of length L with B spots in the regex that could require backtracking, you have a potential for run-time propotial to L**B. Such a regex could run very fast when it finds a match but take the full O(L**B) in cases when there is no match to be found.

            - tye (but my friends call me "Tye")