in reply to How to download JUST the first X bytes of HTTP response ?

I do this exact task myself with an image metadata crawler I wrote. The steps are:

1) Connect to the network socket (S)

2) Turn buffering in the socket off with select(S); $| = 1; select(STDOUT);

3) Send request to server:

$result = eval 'print S "GET /$document HTTP/1.1\nHost: $server_host\n\n"';

4) Read <S> up to the end of the HTTP header (and verify that you have the expected Content-type).

5) Call Image::ExifTool::ImageInfo(\*S, {FastScan => 1}) to read the metadata.

ExifTool will only read as much of the file as necessary to obtain the metadata (the FastScan option prevents reading to the end of the image to look for a metadata trailer).

- Phil

Replies are listed 'Best First'.
Re^2: How to download JUST the first X bytes of HTTP response ?
by fx (Pilgrim) on Dec 09, 2010 at 15:12 UTC

    I'm not 100% convinced with this...but happy to be proved wrong! :)

    I had a little go with some code based on these ideas and found that talking "raw" to the remote web server meant I ended up downloading the entire file regardless of how much I checked (confirmed using Wireshark to track the actual inbound data). This, in my mind, wouldn't save OP any bandwidth as the whole file is coming down the line anyway.....

    As I said, happy to be proved wrong!

    fx, Infinity is Colourless

      You're determined to make me do some real work, aren't you? :P

      As a test, I hacked my crawler to download a single 20MB JPEG image from a remote http server and used tcpdump to view the network traffic. I tested this with and without the ExifTool FastScan option, and repeated the test a few times to make sure I wasn't being fooled by other network traffic. I consistently saw about 50 packets transferred with FastScan set, and about 37000 packets without FastScan.

      So I conclude that this definitely works for me. I can't say what the difference is for you. I ran my tests on Mac OS X 10.4.11 with Perl 5.8.6.

      - Phil
      I prepared a test script which works for me, and transfers only as much of the image as necessary to extract the metadata:
      #! /usr/bin/perl -w # # File: url_info # # Author: Phil Harvey # # Syntax: url_info URL # # Description: test to get image info from image on the net # # Example: url_info http://owl.phy.queensu.ca/~phil/big.jpg # # References: Based on web crawler script: # http://www.linuxjournal.com/files/linuxjournal.com/lin +uxjournal/articles/022/2200/2200l1.html # use strict; use Image::ExifTool; sub url_info($); my $DEBUG = 0; # set to 1 for debugging network stuff my $url = shift or die "Syntax: url_info URL\n"; my $exifTool = new Image::ExifTool; $exifTool->Options(FastScan => 1); my $info = url_info($url); die "No image info for $url\n" unless $info; foreach (sort keys %$info) { print "$_ => $$info{$_}\n"; } exit 1; #--------------------------------------------------------------------- # Get the page at specified URL # Inputs: 0) URL # Returns: 0) #--------------------------------------------------------------------- sub url_info($) { my $url = shift; my ($protocol, $host, $port, $document) = $url =~ m|^([^:/]*)://([^:/]*):*([0-9]*)/*([^:]*)$|; # Some constants used to access the TCP network. my $AF_INET=2; my $SOCK_STREAM=1; # Use default http port if none specified. $port = 80 unless $port; # Get the protocol number for TCP. my ($name,$aliases,$proto) = getprotobyname("tcp"); # Get the IP addresses for the two hosts. my ($type,$len,$thataddr); ($name,$aliases,$type,$len,$thataddr) = gethostbyname($host); # Check we could resolve the server host name. return undef unless defined $thataddr; my ($a,$b,$c,$d) = unpack('C4', $thataddr); if (not defined $d or ($a eq '' && $b eq '' && $c eq '' && $d eq '')) { warn "Unknown host $host.\n"; return undef; } print "Server: $host ($a.$b.$c.$d)\n" if $DEBUG; # Pack the AF_INET magic number, the port, and the (already # packed) IP addresses into the same format as the C structure # would use. Note this is architecture dependent: this pack format # works for 32 bit architectures. my $that = pack("S n a4 x8", $AF_INET, $port, $thataddr); # Create the socket and connect. unless (socket(S, $AF_INET, $SOCK_STREAM, $proto)) { warn "Cannot create socket.\n"; alarm 0; return undef; } print "Socket OK\n" if $DEBUG; local $SIG{ALRM} = sub { die "ALARM\n" }; alarm 3600; # set timeout of 1 hour my $result = eval 'connect(S, $that)'; if ($@ or not $result) { warn "Cannot connect to server $host, port $port.\n"; alarm 0; return undef; } print "Connect OK\n" if $DEBUG; # Turn buffering in the socket off, and send request to the server select(S); $| = 1; select(STDOUT); $result=eval 'print S "GET /$document HTTP/1.1\nHost: $host\n\n"'; if ($@ or not $result) { warn "Timeout when sending to $host, port $port.\n"; alarm 0; return undef; } # Receive the response. Check to ensure the response is of MIME # type text/html or text/plain. my $header = 1; my $header_text = ""; for (;;) { $_ = eval '<S>'; if ($@) { warn "Timeout when reading from $host, port $port\n"; last; } last unless defined $_; # Check if we've hit the end of the HTTP header (empty line) # If we have, check for a content-type header line, and ensure # it is valid. if( m|^[\n\r]*$| ){ $header = 0; my ($content) = $header_text=~/Content-type: ([^\s;]+)/i; if ($content and $content =~ m{^image/(.*)}) { # extract image metadata my $raf = new File::RandomAccess(\*S); $info = $exifTool->ImageInfo($raf); } else { warn "Not an image\n"; } last; # all done } elsif($header == 1){ # Save to a header string if we're still working on the # HTTP header. $header_text .= " " . $_; } } eval 'close S'; alarm 0; print "HTTP header: \n $header_text" if $DEBUG; return $info; }
      - Phil