in reply to Re^2: HTTP:Async weirdness
in thread HTTP:Async weirdness

I changed the code to use sqlite (and fixed a few problems on the way such as that I could serve you SQL if I knew you're crawling my site :)) and it works just fine here. What do you mean by "jumbled" contents anyway?

Here's what works for me:

#!/usr/bin/perl use MIME::Base64; use Encode qw(encode); use DBI; use DBI qw(:sql_types); require HTTP::Request; require HTTP::Response; use HTTP::Async; #HTTP::Async timout is broken by default. Check the cpan page for how + to fix. It's in the bugs page. my $async = HTTP::Async->new(timeout=>60,slots=>100); # I'm on a ter +ribly slow line use List::MoreUtils; use strict; open PIDFILE, ">$ENV{HOME}/pidfile" or die $!; # will run as user print PIDFILE $$; close PIDFILE; #definition of variables my $db="databasename"; my $host="localhost"; my $user="username"; my $password="password"; my $verbose_logging = -1; while (1 == 1) { sleep rand 3; my @sites; my @db_row; my @response_array; my $dbh = DBI->connect ("DBI:SQLite:$db",'','') or die "Can't connect to database: $DBI::errstr\n"; my $sth = $dbh->prepare( "SELECT `page_url`, `url2`, `idx`, `date_ +time`, `userid` FROM `queue` LIMIT 100"); $sth->execute( ); while ( my @row = $sth->fetchrow_array ) { push @sites, $row[0]; push @db_row, [$row[0], $row[1], $row[2], $row[3], $row[4]]; } foreach my $site(@sites) { $async->add( HTTP::Request->new( GET => $site ) ); } while ( $async->not_empty ) { if ( my ($response, $id) = $async->wait_for_next_response ) { print $async->info; my $content = $response->decoded_content; my $result_row = $id - 1; my $urlcount = 0; $urlcount++ while ($content =~ m/$db_row[$result_row][1]/g +i); $content = encode_base64(encode("UTF-8", $content)); my $QueueExecute = $dbh->prepare( "INSERT INTO cheker (`active`,`page_url`,`url2`,`date_ +time`,`userid`,`html_source`) VALUES (?,?,?,?,?,?);" ); $QueueExecute->execute($urlcount,$db_row[$result_row][0],$ +db_row[$result_row][1], $db_row[$result_row][3],$db_row[$result_row][4],$conte +nt ); warn "Problem in retrieving results", $sth->errstr( ), "\n +" if $QueueExecute->err; print "Inserted record into checker\n" if ($verbose_loggin +g >= 0); my $QueueExecute = $dbh->prepare("DELETE FROM queue WHERE +`idx` = ?"); $QueueExecute->execute($db_row[$result_row][2]); print "Deleted row from queue\n" if ($verbose_logging >= 0 +); warn "Problem in retrieving results", $sth->errstr, "\n" i +f $QueueExecute->err; warn "Problem in retrieving results", $sth->errstr, "\n" i +f $sth->err; } else { next; } } }

SQLite stuff:

$ sqlite3 databasename CREATE TABLE cheker (active tinyint, page_url varchar, url2 varchar, d +ate_time datetime, userid integer, html_source text); CREATE TABLE queue(page_url varchar, url2 varchar, idx integer, date_t +ime datetime, userid integer); insert into queue values ("http://google.com/","",1,date(),1); insert into queue values ("http://gmx.com/","",1,date(),1); insert into queue values ("http://twitter.com/","",1,date(),1);

To test:

sqlite3 -list databasename "select html_source from cheker"|base64 -d|less

Replies are listed 'Best First'.
Re^4: HTTP:Async weirdness
by schnibitz (Novice) on Jan 22, 2012 at 14:54 UTC
    Good question, and HUGE thanks for putting that code through your testing. Suppose it processes like 50 requests. I'll end up with about 65-70 written to the DB when all is said and done. Many will have either blank html_source, and many will have a blank page_url. Some entries will be doubled . . . In fact this happens even with lower numbers of requests. The problem seems to be exasperated by sites that timeout or are slow. The code seems to want to grab the results of the request in the middle of the request. I only want enough blocking going on so that  my $content = $response->decoded_content; isn't picking the results off prematurely. There has to be a way to do that. -S
Re^4: HTTP:Async weirdness
by schnibitz (Novice) on Jan 22, 2012 at 15:59 UTC
    Okay I think I figured part of it out (THINK). We'll see, check this out:
    my $result_row = $id - 1;
    My assumption that the row numbers were out of sync was incorrect. Apparently both the result rows, and the HTTP response ID start with row "1". So I was fetching stuff that was out of sync. Now the records match up. Strange thing is though that even though I just submitted 100 requests, it returned 233 with html_content that wasn't associated with an ID or anything.
Re^4: HTTP:Async weirdness
by schnibitz (Novice) on Jan 22, 2012 at 22:34 UTC
    Well I'm happy to report I finally got it working as advertised. I put this: my $async = HTTP::Async->new(timeout=>5,slots=>100); within the main loop so it gets redefined each time and I put: if (@sites == '') {redo;} before the async code. that way when it checks the DB, if the DB is empty, it won't try to run any of the async or dbi code where it gets into trouble. I shouldn't really have to do that, but as it turns out, that's how I got it to behave. The whole thing was UGLY, but it worked. -S