EvanCarroll has asked for the wisdom of the Perl Monks concerning the following question:

I've been having a pretty awkward problem with a few POE modules. I have a script that does the following:

  1. Reads through a CSV getting URLs.
  2. Generates a SHA1 of the URL - which is the sha1(url) (other minor logic here for directories)
  3. If the sha1(url) exists it posts to the kernel a HEAD request for the url, else it posts a GET request.
  4. In the HEAD response handler: post a GET request to the kernel if the file needs to be re-downloaded.
  5. In the GET response handler if there is data: start to download the file to sha1(url)
  6. If there is no data, simply hard link the sha1(url) to the sha1(file) (this way two files can be hosted on the URL at different times)

There is some other minor logic here, this is just a basic parallel HTTP image downloader. The issue is after a certain point, I get one

Cannot connect to imgs.getauto.com:80 (connect error 110: Connection t +imed out)

And then, each subsequent request returns the same thing. No packets are sent out - as shown with tethereal. I've used 'netstat -atn' to establish that my sockets are opening and closing as they should. They do not get stuck in FIN_WAIT2 (as the other POCO:Client:HTTP bug does).

Here is a dump of the request and response after I get bogged down in this endless loop of nothing:

- &1 !!perl/hash:HTTP::Request _content: '' _headers: !!perl/hash:HTTP::Headers accept: image/* from: evan@dealermade.com host: imgs.getauto.com user-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.3) Ge +cko/2008101315 Ubuntu/8.10 (intrepid) Firefox/3.0.3 _method: GET _protocol: HTTP/1.1 _uri: !!perl/scalar:URI::http http://imgs.getauto.com/imgs/ag/ga/62/ +90/1/WDDNG71X47A036290-1.jpg - !!perl/hash:HTTP::Response _content: | <html> <HEAD><TITLE>Error: Internal Server Error</TITLE></HEAD> <BODY> <H1>Error: Internal Server Error</H1> Cannot connect to imgs.getauto.com:80 (connect error 110: Connecti +on timed out) </BODY> </HTML> _headers: !!perl/hash:HTTP::Headers {} _msg: ~ _rc: 500 _request: *1

According to irc.perl.org's dngor (author of module) that response including the HTTP is forged by the HTTP::Response package -- which actually comes close to making my blood boil.

I've even tried to use the perl debugger. To no avail. I set the NoTTY option and then set signal=1 and the whole thing crashes. The debugger does not seem to be poe friendly. I'm totally at a loss, the versions of the modules I'm using are as follows:

POE::Component::Client::Keepalive v0.23 POE::Component::Client::HTTP v0.86
#!/usr/binenv perl BEGIN{ $DB::signal=0; } use strict; use warnings; use Fcntl; use Digest::SHA1 qw(); use IO::File; use Text::CSV; use File::Spec qw(); use File::Basename qw(); use File::Path qw(); use File::stat; use Memoize; memoize( 'generate_tempname' ); use constant VERBOSE => 1; use feature ':5.10'; use HTTP::Request::Common qw(GET POST HEAD); sub POE::Kernel::ASSERT_DEFAULT () { 1 } use POE qw(Component::Client::HTTP) # Component::Client::Keepalive); #my $pool = POE::Component::Client::Keepalive->new( max_per_host => 4, + timeout => 1800, keep_alive => 180 ); POE::Component::Client::HTTP->spawn( Alias => 'dmua' , Streaming => 4096 # , ConnectionManager => $pool , FollowRedirects => 2 , Agent => 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv +:1.9.0.3) Gecko/2008101315 Ubuntu/8.10 (intrepid) Firefox/3.0.3' , From => 'evan@dealermade.com' ); POE::Session->create( inline_states => { _start => \&client_start , _stop => \&client_stop , got_response => \&client_got_response , transfer_complete => \&finalize_transfer } ); $poe_kernel->run(); ### Event handlers begin here. sub client_start { my ($kernel, $heap) = @_[KERNEL, HEAP]; ## $poe_kernel->sig(INT => "_stop"); my $fh = IO::File->new( 'dealermade_pictures.csv', 'r' ); my $header = $fh->getline; my $csv = Text::CSV->new; while ( my $line = $fh->getline ) { $csv->parse( $line ); my ( $picid, $url, $lot, $is_primary ) = $csv->fields; my $temp = generate_tempname( $url ); if ( -e $temp && -f $temp && -s $temp ) { $kernel->post( dmua => request => got_response => HEAD( $u +rl ) ); } else { $kernel->post( dmua => request => got_response => GET( $ur +l, Accept => 'image/*' ) ); } } } sub client_stop { my $heap = $_[HEAP]; } sub client_got_response { my ($heap, $req, $res, $data ) = ( $_[HEAP], $_[ARG0]->[0], @{$_[A +RG1]} ); my $uri = $req->uri; my $temp = generate_tempname( $uri ); given ( $req->method ) { when ( 'HEAD' ) { if ( -e $temp && -f $temp ) { my $stat = stat( $temp ); my $badSize = $res->content_length && $stat->size != $ +res->content_length; my $badDate = $stat->mtime - $res->fresh_until > 0; ## My slow sledge hammer ## use DateTime qw(); ## my $badDate = DateTime->from_epoch( epoch => $stat- +>mtime ) ## ->subtract_datetime( DateTime->from_epoch( epoc +h => $res->fresh_until ) ) ## ->is_positive ## ; if ( VERBOSE ) { if ( $badDate || $badSize ) { say "Posting to the kernel a request to redown +load $uri"; say "\tBAD SIZE detected, our file is ". $stat +->size ." and it should be ". $res->content_length if $badSize ; say "\tBAD DATE detected -- file has since bee +n modified" if $badDate; } else { say "Skipping $uri -- all is current"; } } $poe_kernel->post( dmua => request => got_response => +GET( $uri, Accept => 'image/*' ) ) if $badSize || $badDate ; } else { warn "HEAD requested on non-cached file $temp\n"; } } when ( 'GET' ) { my $this = $_[HEAP]->{uri}{$uri}; my $fh = $this->{fh}; if ( !defined($res->code) || $res->code != '200' ) { say $res->code . " was received from request to $uri"; if ( $res->code == 500 ) { use XXX; YYY [ $req, $res, $_[HEAP]->{connection} +]; $DB::signal=1; } return; } ## If we've never encoutered a response from this request. unless ( $fh ) { if ( VERBOSE ) { say "Started download of $uri : " . $res->code; say "\tDestination temp name:\t$temp"; } ## If the file exists simply unlink it and start over. ## This is here so we can refresh the data behind the +url if ( -e $temp && -f $temp ) { say "\tUnlinking preexiting uri first" if VERBOSE; unlink ( $temp ); } ## Else we might have to create the path to it. else { my $path = File::Basename::dirname( $temp ); unless ( -d $path and -e $path ) { File::Path::mkpath( $path ); say "\tCreating path:\t$path"; } } sysopen ( $fh , $temp , O_WRONLY|O_CREAT ); binmode($fh); ## win32 not required in linux $this = { fh => $fh, temp => $temp, uri => $uri }; $_[HEAP]->{uri}{$uri} = $this; } ## ## If we have data send it to our file handle ## if ( defined $data ) { print $fh $data; } ## ## If we have no more data hard link to store and remove ## else { close $fh; my $stor = generate_storename( $uri ); say "Linking $temp to $stor" if VERBOSE; my $path = File::Basename::dirname( $stor ); File::Path::mkpath( $path ) unless -e $path && -d $pat +h; CORE::link( $temp, $stor ) unless -e $stor ; delete $heap->{uri}{$this->{uri}}; } } } } sub generate_tempname { my $uri = shift; my $sha1 = Digest::SHA1::sha1_hex( $uri ); my ( $f1, $f2, $file ) = unpack ( 'A2A2A*', $sha1 ); $uri =~ /.*([.].*?)$/; my $ext = $1; File::Spec->catfile( qw/out temp/, $f1, $f2, $file . $ext||'.jpg' +); } sub generate_storename { my $uri = shift; my $tempname = generate_tempname($uri); my $io = IO::File->new( $tempname, 'r' ); my $sha1 = Digest::SHA1->new; $sha1->addfile($io); $io->close; my ( $f1, $f2, $file ) = unpack ( 'A2A2A*', $sha1->hexdigest ); $uri =~ /.*([.].*?)$/; my $ext = $1; #File::Spec->catfile( qw/out store/, $sha1->hexdigest . $ext ); File::Spec->catfile( qw/out store/, $f1, $f2, $file . $ext||'.jpg' + ); }
This is what strace will return after a certain point in time, notice it doesn't check sockets or anything complex...
write(1, "---\n- &1 !!perl/hash:HTTP::Reque"..., 808) = 808 write(1, "500 was received from request to"..., 100) = 100 write(1, "---\n- &1 !!perl/hash:HTTP::Reque"..., 808) = 808 write(1, "500 was received from request to"..., 100) = 100 write(1, "---\n- &1 !!perl/hash:HTTP::Reque"..., 808) = 808 write(1, "500 was received from request to"..., 100) = 100 ... forever
Here is the output from POCO::Client::HTTP with the DEBUG and DEBUG_DATA variables set:
T/O: request 149 timed out at /usr/local/lib/perl5/site_perl/5.10.0/P +OE/Component/Client/HTTP.pm line 377. I/O: removing request 149 at /usr/local/lib/perl5/site_perl/5.10.0/PO +E/Component/Client/HTTP.pm line 380. T/O: request 149 has timer 8948 at /usr/local/lib/perl5/site_perl/5.1 +0.0/POE/Component/Client/HTTP.pm line 391. T/O: request 149 is wheel 153 at /usr/local/lib/perl5/site_perl/5.10. +0/POE/Component/Client/HTTP.pm line 397. T/O: request_state = 0x04 I/O: Disconnect, keepalive timeout or HTTP/1.0. at /usr/local/lib/per +l5/site_perl/5.10.0/POE/Component/Client/HTTP.pm line 421.
I don't know enough about what this stuff means, but this is the only suspicious pattern i see repeat itself in strace. My guess: it tries to set a socket with some deep voodoo (failing), and then seek to it (also failing), then it tries to do it all again. Then it assumes it is open, and fails -- and never reconnects.
socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4 ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbfb8b5d8) = -1 EINVAL (Inval +id argument) _llseek(4, 0, 0xbfb8b600, SEEK_CUR) = -1 ESPIPE (Illegal seek) ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbfb8b5d8) = -1 EINVAL (Inval +id argument) _llseek(4, 0, 0xbfb8b600, SEEK_CUR) = -1 ESPIPE (Illegal seek) fcntl64(4, F_SETFD, FD_CLOEXEC) = 0 getpeername(4, 0x1d08a4e8, [256]) = -1 ENOTCONN (Transport endpo +int is not connected)
It should still fail with KeepAlive stuff commented (as is above) it will take longer to get to the fail point though.
DATA FILE The datafile can be found at http://dealermade.com/dealermade_pictures.csv


Evan Carroll
I hack for the ladies.
www.EvanCarroll.com

Replies are listed 'Best First'.
Re: POE Component HTTP problem
by bingos (Vicar) on Nov 28, 2008 at 09:36 UTC

    I finally got around to have a look at this code and the first thing that jumped out and screamed at me was that you are blocking in client_start whilst reading and parsing the CSV input file. What is happening is that the while loop blocks the kernel, whilst you are enqueuing the requests to PoCo-Client-HTTP. These requests are sitting enqueued until the while loop ends and client_start finishes and then in one big whoosh are executed.

    As you appear to be using strace to diagnose the problem you are using Linux I guess, so you can use POE::Wheel::ReadWrite and POE::Filters to parse your input file and be a lot more co-operative

    Only amendments shown:
    use POE qw(Wheel::ReadWrite Filter::Stackable Filter::Line Filter::CSV +); # add extra handlers for "file_input" "file_error" POE::Session->create( inline_states => { _start => \&client_start , _stop => \&client_stop , got_response => \&client_got_response , transfer_complete => \&finalize_transfer , file_input => \&file_input , file_error => \&file_error } ); sub client_start { my ($kernel, $heap) = @_[KERNEL, HEAP]; ## $poe_kernel->sig(INT => "_stop"); my $fh = IO::File->new( 'dealermade_pictures.csv', 'r' ); $heap->{file} = POE::Wheel::ReadWrite->new( Handle => $fh, Filter => POE::Filter::Stackable->new( Filters => [ POE::Filter::Line->new(), POE::Filter::CSV->new(), ], ), InputEvent => 'file_input', ErrorEvent => 'file_error', ); $heap->{_header} = 0; return; } sub file_input { my ($kernel,$heap,$fields) = @_[KERNEL,HEAP,ARG0]; unless ( $heap->{_header} ) { $heap->{_header}++; return; } my ( $picid, $url, $lot, $is_primary ) = @$fields; my $temp = generate_tempname( $url ); if ( -e $temp && -f $temp && -s $temp ) { $kernel->post( dmua => request => got_response => HEAD( $url ) +); } else { $kernel->post( dmua => request => got_response => GET( $url, Ac +cept => 'image/*' ) ); } return; } sub file_error { my ($kernel,$heap) = @_[KERNEL,HEAP]; delete $heap->{file}; return; }

    The reading and parsing of the input file is now asynchronous and HTTP requests are enqueued and happening asynchronously whilst the input file is being read.

Re: POE Component HTTP problem
by waba (Monk) on Nov 27, 2008 at 21:30 UTC

    I don't see anything limiting the amount of concurrent connections in your code nor in PoCo::HTTP::Client's documentation. Is there something that I missed?

    Trying to run your code here with a few debug statements added, the 77802 URLs are stuffed into PoCo::HTTP::Client and it consumes all the CPU. This behaviour alone seem a good reason to limit the amount of sub-tasks running at once.

    Could you try to refactor your code to process N URLs at once only (10?) and see if it happens again?

Re: POE Component HTTP problem
by EvanCarroll (Chaplain) on Nov 30, 2008 at 01:54 UTC
    Yes, the issue is that I've confused HTTP_Timeout, with $POE_MODULE Timeout. I can't think of why anyone would want a random timeout on a module's internal -- when arguably all of the factors are under your control. The issue is that I wanted a timeout in POCO::Keepalive::Client not in POCO::HTTP::Client. I didn't realize I could enqueue so many requests that they would timeout before they even began transmission. Anyway, I have a doc-patch that will prevent the confusion and a code patch that will prevent POCO::Keepalive::Client from having a timeout greater than the timeout in POCO::HTTP::Client.


    Evan Carroll
    I hack for the ladies.
    www.EvanCarroll.com