Re: page parsing regex

The problem is that the maximal ("greedy") match consistent with the + in your regexp extends all the way to the last mousedown in the string. Just put a ? after the +, to force a minimal instead of the (default) maximal match.

BTW, I don't see how the use of regexes to weed out unwanted URLs argues in favor of using them to parse HTML. The latter is a much harder problem, due to the presence of nested balanced delimiters. So much harder, in fact, that "true" regular expressions (as opposed to Perl's regexes-on-steroids) cannot solve it.

I think you are far better off with HTML::LinkExtor or HTML::TokeParser to extract the links. Here's how you'd do it with HTML::TokeParser:

my $parser = HTML::TokeParser->new( \$google_results );
my @links_found;
while ( my $token = $parser->get_tag( 'a' ) ) {
  my $url = $token->[ 1 ]{ href };
  next unless $url =~ m{^http://www\.};
  push @links_found, $url;
}
print "$_\n" for @links_found;
__END__
http://www.ets.org/toefl/
http://www.ets.org/testcoll/
http://www.test.com/
[download]

the lowliest monk

Comment on Re: page parsing regex Download Code

Replies are listed 'Best First'.
Re^2: page parsing regex by Anonymous Monk on May 12, 2005 at 08:34 UTC
Hi. I added the ? and made the code easier but it's still not giving way. `while($google_results =~ m\|<p class=g><a href=(http://.+?)\sonmousedow +n\|gs)` [download]	[reply] [d/l]
Re^3: page parsing regex by tlm (Prior) on May 12, 2005 at 09:08 UTC
It works perfectly for me: `while ( $google_results =~ m\|<p class=g><a href=(http://.+?)\sonmousedown\|gs) { push @links_found, $1; } print "$_\n" for @links_found; __END__ http://www.ets.org/toefl/ http://www.ets.org/testcoll/ http://www.test.com/` [download] BTW, it's a bit perverse to pick a character like `\|`, which has a special meaning within regexps, as your delimiter for `m`. the lowliest monk	[reply] [d/l]