Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I've tried a few different scripts in the past on parsing a Google search results page but I could never figure it out. Does anyone know why this might be? I can parse other search engine pages but Google NEVER works.

Replies are listed 'Best First'.
Re: LWP to Google
by CountZero (Bishop) on Mar 29, 2004 at 19:12 UTC
    There is of course a Google API and even better, the Google modules on CPAN..

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: LWP to Google
by pzbagel (Chaplain) on Mar 29, 2004 at 19:21 UTC

    I second CountZero's recommendation. If you are stumped by the Google API, the Oreilly Google Hacks title can ease you into using it as well as web services. In addition, Google's AUP is pretty clear that they prohibit automated scanning and scraping of their results unless you use the Google API to do it. I understand their reasoning and think it is a fair policy. Please sign up for a Google Developer account and use their API.

    Word of warning, I believe Google blocks by IP if they catch you doing automated scans(w/o using Google API). I don't know how long the block lasts, but if you share your internet connection via NAT, there's a chance you'll punish everyone in the office with no access to Google for a while. I know I'd be one to show up with pitchforks and torches at at your cube if that happened.

    Later

Re: LWP to Google
by fxmakers (Friar) on Mar 29, 2004 at 21:32 UTC
    If you don't want to use the Google API, but parse the web page results, here's my code:
    use strict; use IO::Socket::INET; my $limit = 5; #max number of output &google_search(@ARGV); sub google_search { my $keyword = shift; if (!$keyword) { die("no keywords\n"); } my $socket = IO::Socket::INET->new( Proto => "tcp", PeerAddr => "www.google.com", PeerPort => 80, Timeout => 3 ); if (!$socket) { die("error connecting to the server\n"); } $socket->autoflush(1); my $query = $keyword; $query =~ tr/ /+/; my $desc = ""; my $link = ""; my $junk = ""; my $idx = 0; my $nodoc = 0; print $socket "GET /search?hl=en&ie=ISO-8859-1&q=$query HTTP/1.1\r\n +"; print $socket "Host: www.google.com\r\n"; print $socket "User-Agent: Mozilla/5.0\r\n"; print $socket "Accept: image/gif, image/x-xbitmap, image/jpeg, image +/pjpeg, */*\r\n"; print $socket "Accept-Language: en-us,en;q=0.5\r\n"; print $socket "Connection: Keep-Alive\r\n"; print $socket "\r\n"; while (my $buffer = <$socket>) { $buffer =~ s/\s+$//; $buffer =~ s/^\s+//; $buffer =~ tr/ //s; $buffer =~ s/<b>//g; $buffer =~ s/<\/b>//g; if (!$idx && ($buffer =~ /^<br><br>Your search - $keyword - did no +t match any documents./)) { print STDOUT "no doc found, sorry\n"; $nodoc = 1; last; } else { if (!$desc) { ($junk, $desc) = $buffer =~ /(<\/blockquote>|<div> +|<\/a><\/font> )<p class=g><a href=\S+>(.*?)<\/a>(<br>)?<font size=-1 +>([^<]| \- \[ | \.\.\.|<i>|<span class=f>)/; $desc =~ s/&amp;/&/g; $d +esc =~ s/&quot;/"/g; } if (!$link) { ($junk, $link) = $buffer =~ /(<\/blockquote>|<div> +|<\/a><\/font> )<p class=g><a href=(\S+)>(.*?)<\/a>(<br>)?<font size= +-1>([^<]| \- \[ | \.\.\.|<i>|<span class=f>)/; } if ($desc && $link) { if (++$idx > $limit) { last; } print STDOUT "$idx) $desc\n"; print STDOUT " $link\n"; $desc = ""; $link = ""; } } } close($socket); if (!$idx && !$nodoc) { print STDOUT "no doc found, sorry\n"; } } 1;


    Depending on the results, Google's output page may change, so you have to mix some regex.

    Run it using: perl file.pl you keywords here
    Hope this helps.


    P.S.: Google will change it's web design soon so this code may not work with the new one, I've to try.