in reply to Re: HTTP:Async weirdness
in thread HTTP:Async weirdness

Okay, I tried to sanitize the code as much as possible. Was initially reluctant to do so but I hafta help y'all help me, and this is the best way. And yes, I know I'm a terrible coder:
#!/usr/bin/perl use MIME::Base64; use Encode qw(encode); use DBI; use DBI qw(:sql_types); require HTTP::Request; require HTTP::Response; use HTTP::Async; #HTTP::Async timout is broken by default. Check the cpan page for how + to fix. It's in the bugs page. my $async = HTTP::Async->new(timeout=>5,slots=>100); use List::MoreUtils; use strict; open PIDFILE, ">/usr/path/pidfile" or die $!; print PIDFILE $$; close PIDFILE; #definition of variables my $db="databasename"; my $host="localhost"; my $user="username"; my $password="password"; my $verbose_logging = -1; while (1 == 1) { sleep rand 3; my @sites; my @db_row; my @response_array; #connect to MySQL database my $dbh = DBI->connect ("DBI:mysql:database=$db:host=$host", $user, $password) or die "Can't connect to database: $DBI::err +str\n"; #prepare the query my $sth = $dbh->prepare( "SELECT `page_url`, `url2`, `idx`, `date_time +`, `userid` FROM `queue` LIMIT 100"); #execute the query $sth->execute( ); ## Retrieve the results of a row of data and print while ( my @row = $sth->fetchrow_array( ) ) { #set counter push(@sites, $row[0]); push @db_row, ([$row[0], $row[1], $row[2], $row[3], $row[4]]); } foreach my $site(@sites) { $async->add( HTTP::Request->new( GET => $site ) ); } # Do some processing with $response #while ( my ( $response, $id ) = $async->wait_for_next_response) # { while ( $async->not_empty ) { if ( my ($response, $id) = $async->wait_for_next_response ) { # deal with $response print $async->info; my $content = $response->decoded_content; #Needs to be reencoded #async assigns ID's for all the requests, but they begin with 1, wh +ereas, rows begin with 0 with #sql requests. We'll need to decrement them to make sure everythin +g matches up when we #pick the elements out of the nested array my $result_row = ($id - 1); my $urlcount = 0; #Counts the number occurances of the input in the html content $urlcount++ while ($content =~ m/$db_row[$result_row][1]/gi); #encodes the content so that we can easily store to the DB for lat +er referencing. $content = encode_base64(encode("UTF-8", $content)); my $insert = "INSERT INTO cheker (`active`,`page_url`,`url2`,`date +_time`,`userid`,`html_source`) VALUES ($urlcount,'$db_row[$result_row +][0]','$db_row[$result_row][1]','$db_row[$result_row][3]','$db_row[$r +esult_row][4]','$content');"; my $QueueExecute = $dbh->prepare($insert); $QueueExecute->execute( ); warn "Problem in retrieving results", $sth->errstr( ), "\n" if $QueueExecute->err( ); if ($verbose_logging >= 0) { print "Inserted record into checker\n"; } my $delete = "DELETE FROM queue WHERE `idx` = '$db_row[$result_row +][2]';"; my $QueueExecute = $dbh->prepare($delete); $QueueExecute->execute( ); if ($verbose_logging >= 0) { print "Deleted row from queue\n"; } warn "Problem in retrieving results", $sth->errstr( ), "\n" if $QueueExecute->err( ); warn "Problem in retrieving results", $sth->errstr( ), "\n" if $sth->err( ); } else { next; } } #$result_row++; # my $result_row = 0; # my $response_row = 1; }

Replies are listed 'Best First'.
Re^3: HTTP:Async weirdness
by mbethke (Hermit) on Jan 22, 2012 at 05:28 UTC

    I changed the code to use sqlite (and fixed a few problems on the way such as that I could serve you SQL if I knew you're crawling my site :)) and it works just fine here. What do you mean by "jumbled" contents anyway?

    Here's what works for me:

    #!/usr/bin/perl use MIME::Base64; use Encode qw(encode); use DBI; use DBI qw(:sql_types); require HTTP::Request; require HTTP::Response; use HTTP::Async; #HTTP::Async timout is broken by default. Check the cpan page for how + to fix. It's in the bugs page. my $async = HTTP::Async->new(timeout=>60,slots=>100); # I'm on a ter +ribly slow line use List::MoreUtils; use strict; open PIDFILE, ">$ENV{HOME}/pidfile" or die $!; # will run as user print PIDFILE $$; close PIDFILE; #definition of variables my $db="databasename"; my $host="localhost"; my $user="username"; my $password="password"; my $verbose_logging = -1; while (1 == 1) { sleep rand 3; my @sites; my @db_row; my @response_array; my $dbh = DBI->connect ("DBI:SQLite:$db",'','') or die "Can't connect to database: $DBI::errstr\n"; my $sth = $dbh->prepare( "SELECT `page_url`, `url2`, `idx`, `date_ +time`, `userid` FROM `queue` LIMIT 100"); $sth->execute( ); while ( my @row = $sth->fetchrow_array ) { push @sites, $row[0]; push @db_row, [$row[0], $row[1], $row[2], $row[3], $row[4]]; } foreach my $site(@sites) { $async->add( HTTP::Request->new( GET => $site ) ); } while ( $async->not_empty ) { if ( my ($response, $id) = $async->wait_for_next_response ) { print $async->info; my $content = $response->decoded_content; my $result_row = $id - 1; my $urlcount = 0; $urlcount++ while ($content =~ m/$db_row[$result_row][1]/g +i); $content = encode_base64(encode("UTF-8", $content)); my $QueueExecute = $dbh->prepare( "INSERT INTO cheker (`active`,`page_url`,`url2`,`date_ +time`,`userid`,`html_source`) VALUES (?,?,?,?,?,?);" ); $QueueExecute->execute($urlcount,$db_row[$result_row][0],$ +db_row[$result_row][1], $db_row[$result_row][3],$db_row[$result_row][4],$conte +nt ); warn "Problem in retrieving results", $sth->errstr( ), "\n +" if $QueueExecute->err; print "Inserted record into checker\n" if ($verbose_loggin +g >= 0); my $QueueExecute = $dbh->prepare("DELETE FROM queue WHERE +`idx` = ?"); $QueueExecute->execute($db_row[$result_row][2]); print "Deleted row from queue\n" if ($verbose_logging >= 0 +); warn "Problem in retrieving results", $sth->errstr, "\n" i +f $QueueExecute->err; warn "Problem in retrieving results", $sth->errstr, "\n" i +f $sth->err; } else { next; } } }

    SQLite stuff:

    $ sqlite3 databasename CREATE TABLE cheker (active tinyint, page_url varchar, url2 varchar, d +ate_time datetime, userid integer, html_source text); CREATE TABLE queue(page_url varchar, url2 varchar, idx integer, date_t +ime datetime, userid integer); insert into queue values ("http://google.com/","",1,date(),1); insert into queue values ("http://gmx.com/","",1,date(),1); insert into queue values ("http://twitter.com/","",1,date(),1);

    To test:

    sqlite3 -list databasename "select html_source from cheker"|base64 -d|less
      Good question, and HUGE thanks for putting that code through your testing. Suppose it processes like 50 requests. I'll end up with about 65-70 written to the DB when all is said and done. Many will have either blank html_source, and many will have a blank page_url. Some entries will be doubled . . . In fact this happens even with lower numbers of requests. The problem seems to be exasperated by sites that timeout or are slow. The code seems to want to grab the results of the request in the middle of the request. I only want enough blocking going on so that  my $content = $response->decoded_content; isn't picking the results off prematurely. There has to be a way to do that. -S
      Okay I think I figured part of it out (THINK). We'll see, check this out:
      my $result_row = $id - 1;
      My assumption that the row numbers were out of sync was incorrect. Apparently both the result rows, and the HTTP response ID start with row "1". So I was fetching stuff that was out of sync. Now the records match up. Strange thing is though that even though I just submitted 100 requests, it returned 233 with html_content that wasn't associated with an ID or anything.
      Well I'm happy to report I finally got it working as advertised. I put this: my $async = HTTP::Async->new(timeout=>5,slots=>100); within the main loop so it gets redefined each time and I put: if (@sites == '') {redo;} before the async code. that way when it checks the DB, if the DB is empty, it won't try to run any of the async or dbi code where it gets into trouble. I shouldn't really have to do that, but as it turns out, that's how I got it to behave. The whole thing was UGLY, but it worked. -S