LWP and Google

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: LWP and Google by cLive ;-) (Prior) on Nov 08, 2004 at 07:22 UTC
Get will suffice as a simple interface. But, if you're using this non-commercially, you might want to consider the Google API - it's a lot more interesting to play with and quite powerful. If I remember correctly, you're limited to a 1,000 searches a day. I wrote something with it a while back and loved it. .02 cLive ;-)	[reply]
Re^2: LWP and Google by VSarkiss (Monsignor) on Nov 08, 2004 at 16:47 UTC
You can also use the API via the Net::Google module. I haven't used it myself, but it looks quite functional.	[reply]
Re: LWP and Google by pg (Canon) on Nov 08, 2004 at 07:18 UTC
I tried to search "langley public library", and the URL was `http://www.google.ca/search?hl=en&q=langley+public+library&meta=`, that's the pattern you need. Update: sulfericacid is absolutely right, Thanks for pointing out my mistake! And google does not like LWP::UserAgent either (see update 2 for more), obviously it checks for bot. This works: `use IO::Socket::INET; use strict; use warnings; my $s = IO::Socket::INET->new(Proto=>"tcp", PeerAddr=>"www.google.ca", + PeerPort=>80); my $url = "GET /search?hl=en&q=langley+public+library&meta= HTTP/1.1\ +r\nHost: www.google.ca\r\n\r\n"; print $s $url; while (my $l = <$s>) { print $l; last if ($l =~ /<\/html>/); }` [download] Update 2 ;-) Actually LWP::UserAgent also works with a little trick: `use LWP::UserAgent; use strict; use warnings; my $ua = LWP::UserAgent->new(); $ua->agent(""); my $url = "http://www.google.ca/search?hl=en&q=langley+public+library +&meta="; print $ua->get($url)->content();` [download]	[reply] [d/l] [select]
Re^2: LWP and Google by sulfericacid (Deacon) on Nov 08, 2004 at 07:23 UTC
I've tried to do this a few times before as well (never actually found a solution). Even if you have the search pattern, you can't extract the contents of that page. I think what you need to do is setup your own bot/client in order for Google to allow you access. If you try the following code, you'll see you can't get back the contents.. `#!/usr/bin/perl use warnings; use strict; use LWP::Simple; my $source = get("http://www.google.ca/search?hl=en&q=langley+public+l +ibrary&meta="); print $source;` [download] "Age is nothing more than an inaccurate number bestowed upon us at birth as just another means for others to judge and classify us" sulfericacid	[reply] [d/l]
Re: LWP and Google by Cody Pendant (Prior) on Nov 08, 2004 at 10:20 UTC
This works just fine: `#!/usr/bin/perl use strict; use WWW::Mechanize; my $browser = WWW::Mechanize->new(); $browser->get('http://www.google.com'); $browser->form_name('f'); $browser->field('q','langley public library'); $browser->submit(); print $browser->content();` [download] I've really come to rely on WWW::Mechanize lately. It's great. ($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss') =~y~b-v~a-z~s; print	[reply] [d/l]
Re^2: LWP and Google by petdance (Parson) on Nov 08, 2004 at 17:20 UTC
I prefer to use the submit_form() method. Also, you need to check for errors, either manually or by setting autocheck. `#!/usr/bin/perl use strict; use WWW::Mechanize; my $browser = WWW::Mechanize->new( autocheck=>1 ); $browser->get('http://www.google.com'); $browser->submit_form( form_name => 'f', fields => { q => 'langley public library', }, ); print $browser->content();` [download] xoxo, Andy	[reply] [d/l]
Re: LWP and Google by petdance (Parson) on Nov 08, 2004 at 17:23 UTC
LWP::UserAgent doesn't parse content. WWW::Mechanize does. $ cat ./langley #!/usr/bin/perl use strict; use WWW::Mechanize; my $browser = WWW::Mechanize->new( autocheck=>1 ); $browser->get('http://www.google.com'); $browser->submit_form( form_name => 'f', fields => { q => 'langley public library', }, ); my @links = $browser->links(); for my $link ( @links ) { my $abs = $link->url_abs; next if $abs =~ m[^http://.+google\.com/]; # Google links next if $abs =~ m[^http://\Q64.233.167.104/]; # Cache print $link->text, "\n\t", $link->url, "\n"; } $ ./langley Fraser Valley Regional Library http://www.fvrl.bc.ca/comm_branch_langleycity.asp Sno-Isle Libraries http://www.sno-isle.org/ Public Visitor's Page for NASA Langley Technical Library http://library.larc.nasa.gov/Public/ Notice for NASA Langley Employees visiting the Technical Library ... http://library.larc.nasa.gov/Public/nasalangley.htm Canadian library Web sites and catalogues by region: British ... http://www.collectionscanada.ca/gateway/s22-221-e.html LANGLEY PUBLIC LIBRARY in LANGLEY, Oklahoma Library Data / Profile http://www.librarybug.org/library-OK0111.html 1st Services Squadron - Langley Air Force Base, Virginia http://www.langley.af.mil/1msg/1svs/Library.shtml Public Libraries, Oklahoma (Books) http://www.ohwy.com/ok/l/library.htm Library - Langley High School http://www.fcps.k12.va.us/LangleyHS/library/ Kings Langley Public School Library http://members.ozemail.com.au/~stewil/fivew.html [download] All the other caveats about Google not liking scraping still applies. Please also take a look into Spidering Hacks by Kevin Hemenway and Tara Calishain. xoxo, Andy	[reply] [d/l]
Re: LWP and Google by ikegami (Patriarch) on Nov 08, 2004 at 07:23 UTC
That's not a valid URI. You must specify the protocol (http://).	[reply]
Re: LWP and Google by CountZero (Bishop) on Nov 08, 2004 at 17:18 UTC
I remember that in a previous post, this matter was discussed and that someone pointed out that Google absolutely hates bots on their site and will go even as far as blocking the IP-address of recalcitrant bots. If you have a dynamic IP-address this could really upset your provider. The prefered way to address them automatically is through their published API. CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]