dogen has asked for the wisdom of the Perl Monks concerning the following question:

Hello. I'm using a script that auto fills a html form and parses through the result looking for a match. If all is good I should get the following response:
... <tr> <td class="order2" width="266"><p>Transaction Number:</p> </td> <td class="order3" width="324">V64F66697601</td> </tr> ...
What I need is to find the Transaction # and grab it to save to the record. What I have been trying (well, the latest version anyway) is:
... my $order_reg = "<p>Transaction Number:<\/p>\n.*<\/td>\n.*<td .*>(.*)< +\/td>"; if ($request->content =~ /$order_reg/s) { print "\nmatched $1\n"; }
What happens is that it comes back as matching but it doesn't grab $1, its just blank. It looks like the first ".*" matches everything to the end of the html response. Does anyone have a suggestion as to how the regex should be set up?

D

Replies are listed 'Best First'.
Re: Multiple line regex match
by edan (Curate) on Nov 29, 2004 at 15:12 UTC

    The textbook answer will be "Don't use a Regex to parse HTML, use a parser!" (such as HTML::Parser, HTML::TokeParser).

    If you're really bent on doing it the quick and dirty (and wrong) way, you might want to look at turning your .*'s into .*?, and read on up greediness in perlre to see why.

    --
    edan

      Thanks, the .*? worked, as follows (used with the s option ath the end of the regex):
      my $order_reg = "<p>Transaction Number:<\/p>.*?<\/td>.*?<td .*?>(.*?)< +\/td>";
Re: Multiple line regex match
by gaal (Parson) on Nov 29, 2004 at 15:17 UTC
    If the match succeeded, then the first .* could not have matched eveything to the end of the data.

    But the second .* may have been greedier than you like; do you have this somewhere?

    <td [something]></td>

    (That is, an empty TD element.)

    Bottom line: either try .*? to make your matches not greedy, or (as I'm sure millions of people will have told you before I can press "submit") use an HTML parser and not regular expressions to parse your data. :)

Re: Multiple line regex match
by Fletch (Bishop) on Nov 29, 2004 at 15:15 UTC

    You're most likely getting bitten by greediness and/or the fact that trying to parse HTML with just regexen is bound to end in pain and tears. Use HTML::TreeBuilder or HTML::TokeParser (or one of the derivatives) instead.

Re: Multiple line regex match
by ikegami (Patriarch) on Nov 29, 2004 at 16:59 UTC

    Instead of .*?, you can also use [^<]* It might even be more efficient (due to smaller amount of backtracking), but I'm not sure about that.

    if ($request->content =~ m! <tr> \s* <td[^>]*> \s* <p>Transaction Number:</p> \s* </td> \s* <td[^>]*> \s* ([^<]+) \s* </td> \s* </tr> !xs) { print "\nmatched $1\n"; }