schnibitz has asked for the wisdom of the Perl Monks concerning the following question:

Hey everyone, So check this out:
use HTTP::Async; my $async = HTTP::Async->new; # create some requests and add them to the queue. $async->add( HTTP::Request->new( GET => 'http://www.perl.org/' + ) ); $async->add( HTTP::Request->new( GET => 'http://www.ecclestoad.co. +uk/' ) ); while ( my $response = $async->wait_for_next_response ) { # Do some processing with $response }
When I put a nested array in that last while loop, assign the response/ID to that array, and grab data out of that nested array in the rest of the code (outside the loop (dbi code BTW)), it works. It's slow, but it works. When I put that same DBI code in that loop, the response data gets all jumbled up. Is there anything in how HTTP::Async works that would make what I'm trying to do impossible? Or am I just messing something up. I don't get why a simple nested array would work, but when I put my code in there --the same code that originally referenced data in that array -- with the appropriate modifications the responses get all jumbled up. Any ideas? -S Thinking my dbi code, which inserts rows using data stored in strings, is getting passed over more than once simulteneously. So therefore string values are getting poisoned by other string values. I just don't really know how to prove that, or fix it. Any ideas? -S

Replies are listed 'Best First'.
Re: HTTP:Async weirdness
by GrandFather (Saint) on Jan 22, 2012 at 02:51 UTC

    How about you show us the actual code that is failing rather than have us guess at it? Even better would be to reproduce the problem you are getting by mocking up some code to look like HTTP::Async and HTTP::Request, maybe as simple as:

    use strict; use warnings; package MockAsync; sub new { my ($class, %params) = @_; return bless \%params, $class; } sub add { my ($self, $request) = @_; push @{$self->{sources}}, $request; } sub wait_for_next_response { my ($self) = @_; return if ! @{$self->{sources}}; return splice @{$self->{sources}}, rand @{$self->{sources}}, 1; } package MockRequest; sub new { my ($class, %params) = @_; return bless \%params, $class; } sub uri { my ($self) = @_; return $self->{GET}; } package main; my $async = MockAsync->new; $async->add(MockRequest->new(GET => 'http://www.perl.org/')); $async->add(MockRequest->new(GET => 'http://www.ecclestoad.co.uk/')); while (my $response = $async->wait_for_next_response) { printf "Response from %s\n", $response->uri (); }

    The point being that you can isolate the HTTP related code from the database code and thus isolate the problem code.

    True laziness is hard work
      Okay, I tried to sanitize the code as much as possible. Was initially reluctant to do so but I hafta help y'all help me, and this is the best way. And yes, I know I'm a terrible coder:
      #!/usr/bin/perl use MIME::Base64; use Encode qw(encode); use DBI; use DBI qw(:sql_types); require HTTP::Request; require HTTP::Response; use HTTP::Async; #HTTP::Async timout is broken by default. Check the cpan page for how + to fix. It's in the bugs page. my $async = HTTP::Async->new(timeout=>5,slots=>100); use List::MoreUtils; use strict; open PIDFILE, ">/usr/path/pidfile" or die $!; print PIDFILE $$; close PIDFILE; #definition of variables my $db="databasename"; my $host="localhost"; my $user="username"; my $password="password"; my $verbose_logging = -1; while (1 == 1) { sleep rand 3; my @sites; my @db_row; my @response_array; #connect to MySQL database my $dbh = DBI->connect ("DBI:mysql:database=$db:host=$host", $user, $password) or die "Can't connect to database: $DBI::err +str\n"; #prepare the query my $sth = $dbh->prepare( "SELECT `page_url`, `url2`, `idx`, `date_time +`, `userid` FROM `queue` LIMIT 100"); #execute the query $sth->execute( ); ## Retrieve the results of a row of data and print while ( my @row = $sth->fetchrow_array( ) ) { #set counter push(@sites, $row[0]); push @db_row, ([$row[0], $row[1], $row[2], $row[3], $row[4]]); } foreach my $site(@sites) { $async->add( HTTP::Request->new( GET => $site ) ); } # Do some processing with $response #while ( my ( $response, $id ) = $async->wait_for_next_response) # { while ( $async->not_empty ) { if ( my ($response, $id) = $async->wait_for_next_response ) { # deal with $response print $async->info; my $content = $response->decoded_content; #Needs to be reencoded #async assigns ID's for all the requests, but they begin with 1, wh +ereas, rows begin with 0 with #sql requests. We'll need to decrement them to make sure everythin +g matches up when we #pick the elements out of the nested array my $result_row = ($id - 1); my $urlcount = 0; #Counts the number occurances of the input in the html content $urlcount++ while ($content =~ m/$db_row[$result_row][1]/gi); #encodes the content so that we can easily store to the DB for lat +er referencing. $content = encode_base64(encode("UTF-8", $content)); my $insert = "INSERT INTO cheker (`active`,`page_url`,`url2`,`date +_time`,`userid`,`html_source`) VALUES ($urlcount,'$db_row[$result_row +][0]','$db_row[$result_row][1]','$db_row[$result_row][3]','$db_row[$r +esult_row][4]','$content');"; my $QueueExecute = $dbh->prepare($insert); $QueueExecute->execute( ); warn "Problem in retrieving results", $sth->errstr( ), "\n" if $QueueExecute->err( ); if ($verbose_logging >= 0) { print "Inserted record into checker\n"; } my $delete = "DELETE FROM queue WHERE `idx` = '$db_row[$result_row +][2]';"; my $QueueExecute = $dbh->prepare($delete); $QueueExecute->execute( ); if ($verbose_logging >= 0) { print "Deleted row from queue\n"; } warn "Problem in retrieving results", $sth->errstr( ), "\n" if $QueueExecute->err( ); warn "Problem in retrieving results", $sth->errstr( ), "\n" if $sth->err( ); } else { next; } } #$result_row++; # my $result_row = 0; # my $response_row = 1; }

        I changed the code to use sqlite (and fixed a few problems on the way such as that I could serve you SQL if I knew you're crawling my site :)) and it works just fine here. What do you mean by "jumbled" contents anyway?

        Here's what works for me:

        #!/usr/bin/perl use MIME::Base64; use Encode qw(encode); use DBI; use DBI qw(:sql_types); require HTTP::Request; require HTTP::Response; use HTTP::Async; #HTTP::Async timout is broken by default. Check the cpan page for how + to fix. It's in the bugs page. my $async = HTTP::Async->new(timeout=>60,slots=>100); # I'm on a ter +ribly slow line use List::MoreUtils; use strict; open PIDFILE, ">$ENV{HOME}/pidfile" or die $!; # will run as user print PIDFILE $$; close PIDFILE; #definition of variables my $db="databasename"; my $host="localhost"; my $user="username"; my $password="password"; my $verbose_logging = -1; while (1 == 1) { sleep rand 3; my @sites; my @db_row; my @response_array; my $dbh = DBI->connect ("DBI:SQLite:$db",'','') or die "Can't connect to database: $DBI::errstr\n"; my $sth = $dbh->prepare( "SELECT `page_url`, `url2`, `idx`, `date_ +time`, `userid` FROM `queue` LIMIT 100"); $sth->execute( ); while ( my @row = $sth->fetchrow_array ) { push @sites, $row[0]; push @db_row, [$row[0], $row[1], $row[2], $row[3], $row[4]]; } foreach my $site(@sites) { $async->add( HTTP::Request->new( GET => $site ) ); } while ( $async->not_empty ) { if ( my ($response, $id) = $async->wait_for_next_response ) { print $async->info; my $content = $response->decoded_content; my $result_row = $id - 1; my $urlcount = 0; $urlcount++ while ($content =~ m/$db_row[$result_row][1]/g +i); $content = encode_base64(encode("UTF-8", $content)); my $QueueExecute = $dbh->prepare( "INSERT INTO cheker (`active`,`page_url`,`url2`,`date_ +time`,`userid`,`html_source`) VALUES (?,?,?,?,?,?);" ); $QueueExecute->execute($urlcount,$db_row[$result_row][0],$ +db_row[$result_row][1], $db_row[$result_row][3],$db_row[$result_row][4],$conte +nt ); warn "Problem in retrieving results", $sth->errstr( ), "\n +" if $QueueExecute->err; print "Inserted record into checker\n" if ($verbose_loggin +g >= 0); my $QueueExecute = $dbh->prepare("DELETE FROM queue WHERE +`idx` = ?"); $QueueExecute->execute($db_row[$result_row][2]); print "Deleted row from queue\n" if ($verbose_logging >= 0 +); warn "Problem in retrieving results", $sth->errstr, "\n" i +f $QueueExecute->err; warn "Problem in retrieving results", $sth->errstr, "\n" i +f $sth->err; } else { next; } } }

        SQLite stuff:

        $ sqlite3 databasename CREATE TABLE cheker (active tinyint, page_url varchar, url2 varchar, d +ate_time datetime, userid integer, html_source text); CREATE TABLE queue(page_url varchar, url2 varchar, idx integer, date_t +ime datetime, userid integer); insert into queue values ("http://google.com/","",1,date(),1); insert into queue values ("http://gmx.com/","",1,date(),1); insert into queue values ("http://twitter.com/","",1,date(),1);

        To test:

        sqlite3 -list databasename "select html_source from cheker"|base64 -d|less