Multiple line regex match

dogen has asked for the wisdom of the Perl Monks concerning the following question:

Hello. I'm using a script that auto fills a html form and parses through the result looking for a match. If all is good I should get the following response:

...
        <tr>
         <td class="order2" width="266"><p>Transaction Number:</p>
      </td>
         <td class="order3" width="324">V64F66697601</td>
      </tr>
...
[download]

What I need is to find the Transaction # and grab it to save to the record. What I have been trying (well, the latest version anyway) is:

...
my $order_reg = "<p>Transaction Number:<\/p>\n.*<\/td>\n.*<td .*>(.*)<
+\/td>";

if ($request->content =~ /$order_reg/s) {
  print "\nmatched $1\n";
}
[download]

What happens is that it comes back as matching but it doesn't grab $1, its just blank. It looks like the first ".*" matches everything to the end of the html response. Does anyone have a suggestion as to how the regex should be set up?

D

Comment on Multiple line regex match Select or Download Code

Replies are listed 'Best First'.
Re: Multiple line regex match by edan (Curate) on Nov 29, 2004 at 15:12 UTC
The textbook answer will be "Don't use a Regex to parse HTML, use a parser!" (such as HTML::Parser, HTML::TokeParser). If you're really bent on doing it the quick and dirty (and wrong) way, you might want to look at turning your `.`'s into `.?`, and read on up greediness in perlre to see why. -- edan	[reply]
Re^2: Multiple line regex match by dogen (Acolyte) on Nov 29, 2004 at 15:33 UTC
Thanks, the .? worked, as follows (used with the s option ath the end of the regex): `my $order_reg = "<p>Transaction Number:<\/p>.?<\/td>.?<td .?>(.*?)< +\/td>";` [download]	[reply] [d/l]
Re: Multiple line regex match by gaal (Parson) on Nov 29, 2004 at 15:17 UTC
If the match succeeded, then the first `.` could not have matched eveything to the end of the data. But the second* `.` may have been greedier than you like; do you have this somewhere? `<td [something]></td>` (That is, an empty TD element.) Bottom line: either try `.?` to make your matches not greedy, or (as I'm sure millions of people will have told you before I can press "submit") use an HTML parser and not regular expressions to parse your data. :)	[reply] [d/l] [select]
Re: Multiple line regex match by Fletch (Bishop) on Nov 29, 2004 at 15:15 UTC
You're most likely getting bitten by greediness and/or the fact that trying to parse HTML with just regexen is bound to end in pain and tears. Use HTML::TreeBuilder or HTML::TokeParser (or one of the derivatives) instead.	[reply]
Re: Multiple line regex match by ikegami (Patriarch) on Nov 29, 2004 at 16:59 UTC
Instead of `.?`, you can also use `[^<]` It might even be more efficient (due to smaller amount of backtracking), but I'm not sure about that. `if ($request->content =~ m! <tr> \s* <td[^>]> \s <p>Transaction Number:</p> \s* </td> \s* <td[^>]> \s ([^<]+) \s* </td> \s* </tr> !xs) { print "\nmatched $1\n"; }` [download]	[reply] [d/l] [select]