Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a module that takes a regular expression as part of a config file. It parses a web page and tries to locate a specific link based on the date. One particular page happens to use tables making this difficult. An example page might look like:

<TABLE> <TD><A HREF="/yesterday"><IMG ...></a></td> <TD>5-2-2001</td> <TD><A HREF="/today"><IMG ...></a></td> <TD>5-3-2001</td> </table>

Initially I tried to use m|A HREF="(.*?)">.*?5-3-2001|igs; This works, but not quite the way I want. The problem is, no matter the date given, it always finds the first HREF. I've read over the perlre docs and figured out why it does this, but I can't figure out how to do what I want without using special code

And that, of course, is the other complication. As I said above, I'm trying to use a system I wrote that reads the RE from a config file. I don't want to have to write code specifically for this one instance...I'd rather be able to handle it with the one RE in the config file. Any ideas, or do I just have to suck it up and write code to do this kind of thing?

Replies are listed 'Best First'.
Re: Complicated(?) RE help
by Anonymous Monk on May 03, 2001 at 21:49 UTC
    Of course, after posting this, I made a resounding smack of my hand on my forehead. Since a greedy .* will take as much as it can without invalidating the search, I simply added .* to the beginning of the search. The greedy .* sucks up everything it can until the last HREF before the date. It doesn't take that because that is specified in the regex as a seperate element. If it gobbled up the HREF before the date, the expression would be invalid, so it leaves the one I want alone. Works like a charm.
Re: Complicated(?) RE help
by Masem (Monsignor) on May 03, 2001 at 20:36 UTC
    I believe what is happening is a result of the regex's left-to-right behavior. The first thing that the regex is going to match is the A HREF stuff, which means as soon as it sees the first line, it's happy, and only then looks at the .*? portion. In this case it will then move to the right in non-greedy fashion and when it encounters your date string, it's happy. So it's a valid match by the regex system, but obviously not the match you want.

    what you probably need to do it use Japhy's sex eger reverse RE's to get this, since you really want to match the date first, then go right-to-left. The link has more info on how to set this up (this requires no new modules, fortunately).

    Alternatively, if the data you want is strictly in a table, you could try some of the HTML table parsers at CPAN, but this might be more than necessary and would not be extensible.


    Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain