Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Something I have tried a number of times before is parsing Google.com with LWP::Simple. And as you all probably already know, something as simple as
my $content = get("www.google.com");
Doesn't quite work. Anyone know a work around to this using LWP::Simple where I can actually drag back searches?

Replies are listed 'Best First'.
Re: LWP and Google
by cLive ;-) (Prior) on Nov 08, 2004 at 07:22 UTC

    Get will suffice as a simple interface. But, if you're using this non-commercially, you might want to consider the Google API - it's a lot more interesting to play with and quite powerful. If I remember correctly, you're limited to a 1,000 searches a day. I wrote something with it a while back and loved it.

    .02

    cLive ;-)

      You can also use the API via the Net::Google module. I haven't used it myself, but it looks quite functional.

Re: LWP and Google
by pg (Canon) on Nov 08, 2004 at 07:18 UTC

    I tried to search "langley public library", and the URL was http://www.google.ca/search?hl=en&q=langley+public+library&meta=, that's the pattern you need.

    Update:

    sulfericacid is absolutely right, Thanks for pointing out my mistake! And google does not like LWP::UserAgent either (see update 2 for more), obviously it checks for bot. This works:

    use IO::Socket::INET; use strict; use warnings; my $s = IO::Socket::INET->new(Proto=>"tcp", PeerAddr=>"www.google.ca", + PeerPort=>80); my $url = "GET /search?hl=en&q=langley+public+library&meta= HTTP/1.1\ +r\nHost: www.google.ca\r\n\r\n"; print $s $url; while (my $l = <$s>) { print $l; last if ($l =~ /<\/html>/); }

    Update 2 ;-) Actually LWP::UserAgent also works with a little trick:

    use LWP::UserAgent; use strict; use warnings; my $ua = LWP::UserAgent->new(); $ua->agent(""); my $url = "http://www.google.ca/search?hl=en&q=langley+public+library +&meta="; print $ua->get($url)->content();
      I've tried to do this a few times before as well (never actually found a solution). Even if you have the search pattern, you can't extract the contents of that page.

      I think what you need to do is setup your own bot/client in order for Google to allow you access.

      If you try the following code, you'll see you can't get back the contents..

      #!/usr/bin/perl use warnings; use strict; use LWP::Simple; my $source = get("http://www.google.ca/search?hl=en&q=langley+public+l +ibrary&meta="); print $source;


      "Age is nothing more than an inaccurate number bestowed upon us at birth as just another means for others to judge and classify us"

      sulfericacid
Re: LWP and Google
by Cody Pendant (Prior) on Nov 08, 2004 at 10:20 UTC
    This works just fine:
    #!/usr/bin/perl use strict; use WWW::Mechanize; my $browser = WWW::Mechanize->new(); $browser->get('http://www.google.com'); $browser->form_name('f'); $browser->field('q','langley public library'); $browser->submit(); print $browser->content();
    I've really come to rely on WWW::Mechanize lately. It's great.


    ($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss')
    =~y~b-v~a-z~s; print
      I prefer to use the submit_form() method. Also, you need to check for errors, either manually or by setting autocheck.
      #!/usr/bin/perl use strict; use WWW::Mechanize; my $browser = WWW::Mechanize->new( autocheck=>1 ); $browser->get('http://www.google.com'); $browser->submit_form( form_name => 'f', fields => { q => 'langley public library', }, ); print $browser->content();

      xoxo,
      Andy

Re: LWP and Google
by petdance (Parson) on Nov 08, 2004 at 17:23 UTC
    LWP::UserAgent doesn't parse content. WWW::Mechanize does.
    $ cat ./langley #!/usr/bin/perl use strict; use WWW::Mechanize; my $browser = WWW::Mechanize->new( autocheck=>1 ); $browser->get('http://www.google.com'); $browser->submit_form( form_name => 'f', fields => { q => 'langley public library', }, ); my @links = $browser->links(); for my $link ( @links ) { my $abs = $link->url_abs; next if $abs =~ m[^http://.+google\.com/]; # Google links next if $abs =~ m[^http://\Q64.233.167.104/]; # Cache print $link->text, "\n\t", $link->url, "\n"; } $ ./langley Fraser Valley Regional Library http://www.fvrl.bc.ca/comm_branch_langleycity.asp Sno-Isle Libraries http://www.sno-isle.org/ Public Visitor's Page for NASA Langley Technical Library http://library.larc.nasa.gov/Public/ Notice for NASA Langley Employees visiting the Technical Library ... http://library.larc.nasa.gov/Public/nasalangley.htm Canadian library Web sites and catalogues by region: British ... http://www.collectionscanada.ca/gateway/s22-221-e.html LANGLEY PUBLIC LIBRARY in LANGLEY, Oklahoma Library Data / Profile http://www.librarybug.org/library-OK0111.html 1st Services Squadron - Langley Air Force Base, Virginia http://www.langley.af.mil/1msg/1svs/Library.shtml Public Libraries, Oklahoma (Books) http://www.ohwy.com/ok/l/library.htm Library - Langley High School http://www.fcps.k12.va.us/LangleyHS/library/ Kings Langley Public School Library http://members.ozemail.com.au/~stewil/fivew.html
    All the other caveats about Google not liking scraping still applies. Please also take a look into Spidering Hacks by Kevin Hemenway and Tara Calishain.

    xoxo,
    Andy

Re: LWP and Google
by ikegami (Patriarch) on Nov 08, 2004 at 07:23 UTC
    That's not a valid URI. You must specify the protocol (http://).
Re: LWP and Google
by CountZero (Bishop) on Nov 08, 2004 at 17:18 UTC
    I remember that in a previous post, this matter was discussed and that someone pointed out that Google absolutely hates bots on their site and will go even as far as blocking the IP-address of recalcitrant bots. If you have a dynamic IP-address this could really upset your provider.

    The prefered way to address them automatically is through their published API.

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law