page parsing regex

coldfingertips has asked for the wisdom of the Perl Monks concerning the following question:

I know someone will say you shouldn't use reg expressions on HTML but if I did the LinkExtractor I'd have had links I didn't want and would have had to do regexes anyhow.

   while($google_results =~ m|<p\sclass=g><a href=(http://www\..+)\son
+mousedown|gs) 
   {
      push (@links_found, $1);
   }
[download]

For some reason that's not matching with this data...

<p class=g><a href=http://www.ets.org/toefl/ onmousedown="return clk(t
+his,'res',1)">Welcome to TOEFL: The <b>Test</b> of English as a Forei
+gn Language</a><br><font size=-1>Information about the TOEFL tests an
+d services are available online. Try the<br>
TOEFL practice questions.<br><font color=#008000>www.ets.org/toefl/ - 
+ 18k - </font><nobr>  <a class=fl href="http://64.233.161.104/search?
+q=cache:Gq-KV5uuj6YJ:www.ets.org/toefl/+test&hl=en">Cached</a> - <a c
+lass=fl href="/search?hl=en&lr=&safe=off&q=related:www.ets.org/toefl/
+">Similar&nbsp;pages</a></nobr></font> <blockquote class=g><p class=g
+><a href=http://www.ets.org/testcoll/ onmousedown="return clk(this,'r
+es',2)">The ETS <b>Test</b> Collection includes an extensive library 
+of 20000 <b>...</b></a><br><font size=-1>The ETS <b>Test</b> Collecti
+on includes an extensive library of 20000 tests and other<br>
measurement devices from the early 1900s to the present.&lt;/<br><font
+ color=#008000>www.ets.org/<b>test</b>coll/ -  11k - </font><nobr>  <
+a class=fl href="http://64.233.161.104/search?q=cache:mY1iJUWuYoEJ:ww
+w.ets.org/testcoll/+test&hl=en">Cached</a> - <a class=fl href="/searc
+h?hl=en&lr=&safe=off&q=related:www.ets.org/testcoll/">Similar&nbsp;pa
+ges</a></nobr></font> </blockquote><p class=g><a href=http://www.test
+.com/ onmousedown="return clk(this,'res',3)">
[download]

As you can see I am going through a lot of junk just to get all the URLs from the search page. Can someone help tweak my regex a bit?

Comment on page parsing regex Select or Download Code

Replies are listed 'Best First'.
Re: page parsing regex by gellyfish (Monsignor) on May 12, 2005 at 08:57 UTC
Pah! Of course you can do it with a proper HTML parser and not have to use regexes: #!/usr/bin/perl use strict; use warnings; use HTML::Parser; my $parser = HTML::Parser->new( start_h => [ \&start, "self,tag,attr" ], start_document_h => [ \&init, "self" ] ); $parser->parse_file('ert.html'); foreach my $link ( @{ $parser->{_links} } ) { print $link, "\n"; } sub init { my ($self) = @_; $self->{_links} = []; } sub start { my ( $self, $tag, $attribs ) = @_; if ( $tag eq 'p' && exists $attribs->{class} && $attribs->{class} +eq 'g' ) { $self->handler( start => \&get_href, "self,tag,attr" ); } } sub get_href { my ( $self, $tag, $attribs ) = @_; if ( $tag eq 'a' && exists $attribs->{href} && exists $attribs->{onmousedown} ) { push @{ $self->{_links} }, $attribs->{href}; } $self->handler( start => \&start, "self,tag,attr" ); } [download] /J\	[reply] [d/l]
Re: page parsing regex by tlm (Prior) on May 12, 2005 at 08:12 UTC
The problem is that the maximal ("greedy") match consistent with the `+` in your regexp extends all the way to the last `mousedown` in the string. Just put a `?` after the `+`, to force a minimal instead of the (default) maximal match. BTW, I don't see how the use of regexes to weed out unwanted URLs argues in favor of using them to parse HTML. The latter is a much harder problem, due to the presence of nested balanced delimiters. So much harder, in fact, that "true" regular expressions (as opposed to Perl's regexes-on-steroids) cannot solve it. I think you are far better off with HTML::LinkExtor or HTML::TokeParser to extract the links. Here's how you'd do it with HTML::TokeParser: `my $parser = HTML::TokeParser->new( \$google_results ); my @links_found; while ( my $token = $parser->get_tag( 'a' ) ) { my $url = $token->[ 1 ]{ href }; next unless $url =~ m{^http://www\.}; push @links_found, $url; } print "$_\n" for @links_found; __END__ http://www.ets.org/toefl/ http://www.ets.org/testcoll/ http://www.test.com/` [download] the lowliest monk	[reply] [d/l]
Re^2: page parsing regex by Anonymous Monk on May 12, 2005 at 08:34 UTC
Hi. I added the ? and made the code easier but it's still not giving way. `while($google_results =~ m\|<p class=g><a href=(http://.+?)\sonmousedow +n\|gs)` [download]	[reply] [d/l]
Re^3: page parsing regex by tlm (Prior) on May 12, 2005 at 09:08 UTC
It works perfectly for me: `while ( $google_results =~ m\|<p class=g><a href=(http://.+?)\sonmousedown\|gs) { push @links_found, $1; } print "$_\n" for @links_found; __END__ http://www.ets.org/toefl/ http://www.ets.org/testcoll/ http://www.test.com/` [download] BTW, it's a bit perverse to pick a character like `\|`, which has a special meaning within regexps, as your delimiter for `m`. the lowliest monk	[reply] [d/l]
Re: page parsing regex by Animator (Hermit) on May 12, 2005 at 08:28 UTC
have you considered using DBD::Google? (although I'm not sure if it does what you want, I nevered used it, I only know it exists) (I'm guessing that you are parsing a google page by the name of the var and the output you showed. Or is the purpose to use it on other pages aswell?)	[reply]