I'd point you to WWW::Mechanize which is very good for this kind of thing.

Incidently, note that the terms of service of Google prohibits automatic use, and they check, so don't do too many queries this way (thanks bunnyman for providing the link to the terms of service document).

Hope this helps, -gjb-

Update: I realized I could give you a complete example to get you started, see below.

#!/usr/bin/perl use strict; use warnings; use diagnostics; use Set::Scalar; use WWW::Mechanize; use URI; # URL for DBLP search my $dblpURL = 'http://www.informatik.uni-trier.de/~ley/db/indices/a-tr +ee/'; # Delay time my ($delayMean, $delayVar) = (5, 3); # print headers of output print join("\t", ('name', 'DBLP entry', 'home page', 'email')), "\n"; # iterate over list of author's names my $counter = 0; while (<>) { # since one should be polite (or stealthy), sleep for a while sleep($delayMean + int($delayVar/2 - rand($delayVar))); $counter++; my @data; s/^[ ]*(.+?)[ \r\n]*$/$1/; my ($name, $dblpEntry, $homePage, $email) = split(/\t/, $_); print STDERR "now handling record $counter: '$name'\n"; if (!defined $homePage || $homePage eq '') { push(@data, $name); my $mech = WWW::Mechanize->new(); $mech->agent_alias('Windows IE 6'); # get DBLP search page $mech->get($dblpURL); # insert author's name in search page and submit $mech->form_number(1); $mech->field('author', $name); $mech->submit(); # if author has a DBLP entry, the resulting page will # have his name as title, if not, stop processing # author if ($mech->title() !~ /$name/) { print join("\t", (@data, 0, "", "")), "\n"; next; } push(@data, 1); # search for a link that has 'Home Page' as text and # follow it, stop processing if there is none if (!defined $mech->follow_link(text => "Home Page", n => 1) && !defined $mech->follow_link(text => "Home page", n => 1) && !defined $mech->follow_link(text => "home Page", n => 1) && !defined $mech->follow_link(text => "home page", n => 1) && !defined $mech->follow_link(text => "Homepage", n => 1) && !defined $mech->follow_link(text => "HomePage", n => 1) && !defined $mech->follow_link(text => "homePage", n => 1) && !defined $mech->follow_link(text => "homepage", n => 1)) { print join("\t", (@data, "", "")), "\n"; next; } push(@data, $mech->uri()); # retrieve all links on the author's home page and # output only those that are 'mailto' URLs my @links = map { URI->new($_->[0]) } $mech->links(); my $addresses = Set::Scalar->new(); foreach my $link (@links) { if (defined $link->scheme() && ($link->scheme() eq 'mailto') && !$addresses->contains($link->opaque())) { print join("\t", (@data, $link->opaque())), "\n"; $addresses->insert($link->opaque()); } } print join("\t", (@data, "")), "\n" if $addresses->size() == 0; } else { print join("\t", ($name, $dblpEntry, $homePage, $email)), "\n"; } }

In reply to Re: My First HTML/Web based script. by gjb
in thread My First HTML/Web based script. by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.