Ok this is my (i'm sure highly inefficient and embarrisingly limited) perl code for a web robot;) I have missed lots and lots out to try not to bore people with the details. It works fine for ages, but just crashes out when looking up certain urls... weird.

#!/usr/bin/perl -w ######################################################## # Program written by Tom Freeman 2003 # ----------------------------------- # Purpose: To collect information about the technologies # being used on websites, particularly client side # technologies. As well as the server the sites are using. # # Improvements # v1.8: Added support for the META exclusion protocol # i.e. <META name="ROBOTS" content="NOINDEX, NOFOLLOW"> # # v3.3: now retains a list of new domains on disk in # domains.dat file # ######################################################## require LWP::Parallel::UserAgent; use HTTP::Request; #use LWP::Debug qw(+); # display tons of debugging messages. See 'perl +doc LWP::Debug' use URI; use URI::URL; use HTML::LinkExtor; use LWP::Simple qw(get); use Time::SoFar qw( runtime runinterval figuretimes ); use Fcntl; use DBI; my $domainsfile = 'domains.dat'; ####### VARIABLES ###################################### my $pua = LWP::Parallel::UserAgent->new(); $pua->in_order (1); # handle requests in order of registration $pua->duplicates(0); # ignore duplicates $pua->timeout (2); # in seconds $pua->redirect (1); # follow redirects my @domains=(); push @domains, 'http://www.bbc.co.uk'; #load first value onto the doma +ins array my $pass_counter = '0'; $num_domains = scalar @domains; while ($num_domains > '0') { # Start Loopy Code ######## Load up the old domains file from disk ###### open DOMAINSFILE, "< $domainsfile" || print "Failed to open logfile +for writing: $!"; @domains = (<DOMAINSFILE>); close(DOMAINSFILE); ###################################################### $pass_counter++; my @selected_domains=(); while (scalar @selected_domains <= 10) { my $new_domain = shift @domains; my $visited = &checkExists($new_domain); if (!$visited) { push @selected_domains, $new_domain; } } my $reqs = [ HTTP::Request->new('GET', $selected_domains[0]), HTTP::Request->new('GET', $selected_domains[1]), HTTP::Request->new('GET', $selected_domains[2]), HTTP::Request->new('GET', $selected_domains[3]), HTTP::Request->new('GET', $selected_domains[4]), HTTP::Request->new('GET', $selected_domains[5]), HTTP::Request->new('GET', $selected_domains[6]), HTTP::Request->new('GET', $selected_domains[7]), HTTP::Request->new('GET', $selected_domains[8]), HTTP::Request->new('GET', $selected_domains[9]), ]; foreach my $req (@$reqs) { if ($req->url) { print "Registering ".$req->url."\n"; if ( my $res = $pua->register ($req) ) { print STDERR $res->error_as_HTML; } } } $entries = $pua->wait(); foreach (keys %$entries) { my $res = $entries->{$_}->response; my $domain = $res->request->url; if ($res->is_success) { # Do lots of interesting stuff with the # content and write out to the database } } $num_domains = scalar @domains; # get the size of array ######### Write out the new domains file ###### open DOMAINSFILE, "> $domainsfile" || print "Failed to open logfile +for writing: $!"; foreach my $domain (@domains) { print DOMAINSFILE "$domain\n"; } close(DOMAINSFILE); ############################################### (@domains)=(); undef %$entries; }

I hope this isn't too much code to annoy people, i'd be interested to hear any of your thoughts, ideas and comments?

Many Thanks,
Tom


In reply to Re: Re: Problem with LWP::Parallel::UserAgent by CodeJunkie
in thread Problem with LWP::Parallel::UserAgent by CodeJunkie

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.