Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re: Google Earth Monks

by petdance (Parson)
on Jul 03, 2006 at 06:56 UTC ( [id://558944]=note: print w/replies, xml ) Need Help??


in reply to Google Earth Monks

Haven't seen your code, but if you're extracting links from the monk page, take a look at using WWW::Mechanize and having it help you out on that. Should make your code simpler.

xoxo,
Andy

Replies are listed 'Best First'.
Re^2: Google Earth Monks
by McDarren (Abbot) on Jul 03, 2006 at 08:19 UTC
    Thanks :)

    Actually, I've never used WWW::Mechanize, so it didn't occur to me to try that. The routine I use for scraping the data from the Monk homenodes is given below. I think the main performance hit is the fact that I need to issue a separate request for each Monk. Ideally, it would be good to be able to grab all this information in a single go. But I'm not aware of any way that this is currently possible.

    sub get_monk_stats { my $ref = shift; my $monk_url = 'http://www.perlmonks.org/?node_id='; my %monk_fields = ( 'User since:' => 1, 'Last here:' => 1, 'Experience:' => 1, 'Level:' => 1, 'Writeups:' => 1, ); MONK: foreach my $id (keys %{$ref}) { print "Getting data for $ref->{$id}{name} ($id)\n"; my $ua = LWP::UserAgent->new(); my $req = HTTP::Request->new(GET=>"$monk_url$id"); my $result = $ua->request($req); next MONK if !$result->is_success; my $content = $result->content; my $p = HTML::TokeParser->new(\$content); while (my $tag = $p->get_tag("td")) { my $text = $p->get_trimmed_text("/td"); if ($monk_fields{$text}) { $p->get_tag("td"); $ref->{$id}{$text} = $p->get_trimmed_text("/td"); } } } return $ref; }
      Ideally, it would be good to be able to grab all this information in a single go. But I'm not aware of any way that this is currently possible.

      You can work in parallel using POE::Component::Client::HTTP. Check it out.

      --
      David Serrano

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://558944]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (6)
As of 2024-04-19 12:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found