tachyon has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have an interesting issue. The code that follows has been in testing for a while and uses DID use read() to read data from a socket connection to a web server. The URL noted led to an interesting problem with the read hanging at 97%. Changing to sysread solved the problem. This is the first URL to display this behaviour.

Questions?

  1. Why is the read broken? Broken on both Win32 and Linux for this particular download (only currently known test case but no doubt there are many others). It appears likely that the missing data is simply being buffered somewhere.
  2. The server on the end of the socket claims to be IIS/5.0
  3. In the docs there is a note that sysread() bypasses stdio, so mixing this with other kinds of reads, print, write, seek, tell, or eof can cause confusion because stdio usually buffers data.
  4. Does this mean I should change print $fh to syswrite()? Internally LWP::Simple happily intermixes print and sysread....

Why might this be important?

Both CGI and CGI::Simple use read() to get POST data. There is a transient, difficult to prove with a reliable test case, issue with both Modules and some Browsers in certain circumstances - Namely with large POSTs (not multipart form) sometimes all the expected data fails to be got by the read call. read() is blocking and should get all the data you asked for (if sent). The threshold for large appears to be ~20K ? 16384. You can kill read with signals due to the non-reentrant behaviour of the old C libs but getting it to return with a short read any other way with a test case has proved problematic. Wisdom appreciated.

#!/usr/bin/perl -w use strict; use IO::Socket::INET; $|++; my $url = "http://ftp.blizzard.com/pub/war3/maps/(4)iceforge.zip"; my $DEBUG = 1; my $CRLF = "\015\012\015\012"; my ( $code, $type, $length, $sock, $data_buffer, $location ) = init_do +wnload( $url ); open my $fh, '>c:/tmp.zip' or die $!; binmode $fh; print $fh $data_buffer; download( $fh, $sock, $filename, length($data_buffer), $length ); sub download { my ( $fh, $sock, $filename, $got_so_far, $length ) = @_; my $buffer; print "Got: $got_so_far\n" if $DEBUG; # # This will hang on a read() works with sysread() # while ( ($got_so_far < $length) and sysread( $sock, $buffer, 8192 +) ){ print $fh $buffer; $got_so_far += length $buffer; print "Got: $got_so_far\n" if $DEBUG; #write_lockfile( $filename, $got_so_far ); } close $fh; $sock->close; print "Wanted: $length\nGot $got_so_far\n"; unless ( $length == $got_so_far ) { die "Expected $length bytes but only got $got_so_far" ; } } sub init_download { my ( $url ) = @_; ui_network_error( "Invalid URL $url\n" ) unless $url =~ m!^http:// +([^/:\@]+)(?::(\d+))?(/\S*)?$!; my $host = $1; my $port = $2 || 80; my $path = $3; $path = "/" unless defined $path; my $sock = IO::Socket::INET->new( PeerAddr => $host, Proto => 'tcp +', PeerPort => $port ) or ui_network_error( 'Could not connect socket', $url ); $sock->autoflush; print $sock "GET $url HTTP/1.0 Host: localhost Accept: */* Connection: Keep-Alive User-Agent: Mozilla/4.0 (compatible; MSIE 4.5; Windows 98; ) $CRLF"; my ($header, $content, $buffer); while (sysread( $sock, $buffer, 8192 )){ $content .= $buffer; if ( (my $index = (index $content, $CRLF)) > 0 ) { $header = substr $content, 0, $index; $content = substr $content, $index+ 4; last; } } $header =~ s/\015\012/\n/g; # unfold the header $header =~ s/\n\s+/ /g; my ($length) = $header =~ m/^Content-Length:\s*(\d+)/im; my ($type) = $header =~ m/^Content-Type:\s*([^\r\n]+)/im; my ($loc) = $header =~ m/^Location:\s*([^\r\n]+)/im; my ($code) = $header =~ m!^HTTP/\d\.\d[^\d]+(\d+)!i; print "$header\n----\nWant: $length\n"; return ( $code, $type, $length, $sock, $content, $loc ) } sub ui_network_error{ die shift }

cheers

tachyon

s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Replies are listed 'Best First'.
Re: read aka fread(3) broken, sysread aka read(2) works IIS socket
by dws (Chancellor) on Sep 11, 2003 at 07:13 UTC
    antirice provides a valuable observation. The underlying issue is this: read() buffers; sysread() doesn't. This matters when the socket is still open, as it will be when you use the   Connection: Keep-Alive header. This tells the web server that you expect to reuse the socket for additional requests. The web server obliges by keeping the socket open and by sending a Content-Length header in the response to tell you exactly how many bytes it is safe for you to read. If you try to read more, you'll block.

    read(), given a buffer size to read, will wait for the socket to fill up the buffer, and will block until more bytes are available. sysread() will return the number of bytes that are actually available.

    Alternatively, you could use the Content-Length that the web server has returned to you (and which you've squirreled away in $length), and read() only that many bytes.

      Good point ++dws. I tried stepping by 1 (since 130171 is prime :-/) and it worked. I came back after doing a little work solving a stupid math error on my part with this code:

      ... my $readamount = int(($length - $got_so_far)/8192)?8192:$length-$g +ot_so_far; while ( ($got_so_far < $length) and read( $sock, $buffer, $readamo +unt ) ){ print $fh $buffer; $got_so_far += length $buffer; $readamount = int(($length - $got_so_far)/8192)?8192:$length-$ +got_so_far; print "Got: $got_so_far\n" if $DEBUG; } ...

      and you had already made the points I was going to make. However, to the OP: Hope this helps.

      Update: Thanks to dws for pointing out a possible block situation. Note to the OP that the read under init_download should probably choose a smaller chunk to be read. Of course, you could also just use sysread. :)

      antirice    
      The first rule of Perl club is - use Perl
      The
      ith rule of Perl club is - follow rule i - 1 for i > 1

Re: read aka fread(3) broken, sysread aka read(2) works IIS socket
by antirice (Priest) on Sep 11, 2003 at 04:36 UTC

    Dunno. However, if you take out "Connection: Keep-Alive" from the header, it works as expected.

    Hope this helps.

    antirice    
    The first rule of Perl club is - use Perl
    The
    ith rule of Perl club is - follow rule i - 1 for i > 1

Re: read aka fread(3) broken, sysread aka read(2) works IIS socket
by dws (Chancellor) on Sep 11, 2003 at 08:32 UTC
    On a related noted (related by code, not by problem): Using HTTP/1.0 will prevent your code from working against a virtually hosted site. (Virtual hosting allows multiple domains to share an IP address. Making that work requires passing a domain name in the HTTP request.) Switching to HTTP/1.1 should be very, very easy. Try changing
    print $sock "GET $url HTTP/1.0 Host: localhost
    to
    print $sock "GET $url HTTP/1.1 Host: $host
    That should do it.

      Using the "Host:" header with HTTP/1.0 is acceptable. Web servers that supports virtual hosts almost always support it. Ancient web servers that don't support virtual hosts will usually just ignore the header. It is safest to always include the "Host" header with HTTP/1.0 requests.

      More importantly, the HTTP/1.1 protocol has requirements for clients that manual scripts aren't willing to do. For example, HTTP/1.1 clients must support chunked encoding. And handle persistent connections gracefully. Clients don't want a persistent connection should send "Connection: close" header.