Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I am working on a system with a large number of JPEG images stored online (and accessible only via HTTP - no local access). Images change frequently, and appear and disappear very regularly.

I would like to catalogue the various makes and models of cameras used to create these images, and report on various other stats relating to the metadata.

I've got something working now which downloads the images, uses Image::ExifTool to get at the metadata and all works - brilliant!

However, trawling through these images is a massive drain on bandwidth (as I'm downloading the entire image for every image) and as I'm contracting and working from home I'm starting to hit my bandwidth allowance and need to streamline my downloading.

I've got a new script working locally on a cache of images that only needs the first few bytes (enough to grab the portion of the file contain the EXIF metadata) of the file in order for me to get at the data I want. This works fine and will be a massive saving on network bandwidth when the images can be several megs and I only need 1K at most.

So, now is the time to make my new script work over the network but I'm stuck......how do I download just the first 1K (or whatever) of a file over HTTP? Can I do this with the LWP packages or should I be making raw TCP connections and simply closing the socket when I have enough data?

Your advice would be greatly appreciated!!!!

  • Comment on How to download JUST the first X bytes of HTTP response ?

Replies are listed 'Best First'.
Re: How to download JUST the first X bytes of HTTP response ?
by BrowserUk (Patriarch) on Dec 07, 2010 at 17:34 UTC

    This worked (fetched just the first 1000 bytes) for every server I tried from a random selection thrown up by google image search:

    #! perl -slw use strict; use LWP::UserAgent; my $ua = LWP::UserAgent->new; my $resp = $ua->get( "http://www.somesite.com/product_images/theImage.jpg", Range => 'bytes=0-999' ) or die; print unpack 'H*', $resp->content;

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      In case anyone's interested, I had a go with some code using this method and it worked with only the specified bytes coming down the line.

      However, with specific reference to OP's requirement, I've found that the EXIF portion of a JPEG file isn't a fixed length and for very large JPEGs can be larger than you expect... A few tests I ran (as I'm finding this thread particularly interesting...) showed that sometimes over 10K of download was needed before ExifTool would recognise the EXIF data...

      fx, Infinity is Colourless

        Indeed. I just inspected a .jpg at random and it contained the EXIF data shown below. The first field (OLYMPUS DIGITAL CAMERA) starts at offset 0xbc; and the last (JpegIFByteCount - 8092) ends at 0x2a19. On the scant basis of those two images--yours and mine--10k seem like a good starting point. For this image, that is still a substantial saving over the 677kb for the full image.

        ImageDescription - OLYMPUS DIGITAL CAMERA Make - OLYMPUS CORPORATION Model - E-1 Orientation - Top left XResolution - 314.00 YResolution - 314.00 ResolutionUnit - Inch Software - Adobe Photoshop CS4 Windows DateTime - 2010:08:22 16:33:44 YCbCrPositioning - Co-Sited ExifOffset - 540 ExposureTime - 1/800 seconds FNumber - 6.30 ExposureProgram - Aperture priority ISOSpeedRatings - 200 ExifVersion - 0221 DateTimeOriginal - 2010:08:22 11:15:59 DateTimeDigitized - 2010:08:22 11:15:59 ComponentsConfiguration - YCbCr ExposureBiasValue - 0.00 MaxApertureValue - F 3.50 MeteringMode - Spot LightSource - Auto Flash - Not fired FocalLength - 14 mm UserComment - FlashPixVersion - 0100 ColorSpace - sRGB ExifImageWidth - 2560 ExifImageHeight - 1920 InteroperabilityOffset - 1120 FileSource - DSC - Digital still camera CustomRendered - Normal process ExposureMode - Auto White Balance - Auto DigitalZoomRatio - 0.00 x SceneCaptureType - Standard GainControl - Low gain up Contrast - Normal Saturation - Normal Sharpness - Normal Thumbnail: - Compression - 6 (JPG) XResolution - 72 YResolution - 72 ResolutionUnit - Inch JpegIFOffset - 1246 JpegIFByteCount - 8092

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: How to download JUST the first X bytes of HTTP response ?
by CountZero (Bishop) on Dec 07, 2010 at 17:20 UTC
    I found this in the docs of LWP::UserAgent:
    $ua->max_size( $bytes )
    Get/set the size limit for response content. The default is undef, which means that there is no limit. If the returned response content is only partial, because the size limit was exceeded, then a "Client-Aborted" header will be added to the response. The content might end up longer than max_size as we abort once appending a chunk of data makes the length exceed the limit. The "Content-Length" header, if present, will indicate the length of the full content and will normally not be the same as length($res->content).

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: How to download JUST the first X bytes of HTTP response ?
by JavaFan (Canon) on Dec 07, 2010 at 17:16 UTC
    LPW gives you access to HTTP::Request, which gives you access to HTTP::Header, which you can use to set the HTTP 'Range' header.

    Whether the server is actually going to be able to deal with that header is a different question.

Re: How to download JUST the first X bytes of HTTP response ?
by boardhead (Novice) on Dec 08, 2010 at 14:15 UTC

    I do this exact task myself with an image metadata crawler I wrote. The steps are:

    1) Connect to the network socket (S)

    2) Turn buffering in the socket off with select(S); $| = 1; select(STDOUT);

    3) Send request to server:

    $result = eval 'print S "GET /$document HTTP/1.1\nHost: $server_host\n\n"';

    4) Read <S> up to the end of the HTTP header (and verify that you have the expected Content-type).

    5) Call Image::ExifTool::ImageInfo(\*S, {FastScan => 1}) to read the metadata.

    ExifTool will only read as much of the file as necessary to obtain the metadata (the FastScan option prevents reading to the end of the image to look for a metadata trailer).

    - Phil

      I'm not 100% convinced with this...but happy to be proved wrong! :)

      I had a little go with some code based on these ideas and found that talking "raw" to the remote web server meant I ended up downloading the entire file regardless of how much I checked (confirmed using Wireshark to track the actual inbound data). This, in my mind, wouldn't save OP any bandwidth as the whole file is coming down the line anyway.....

      As I said, happy to be proved wrong!

      fx, Infinity is Colourless

        You're determined to make me do some real work, aren't you? :P

        As a test, I hacked my crawler to download a single 20MB JPEG image from a remote http server and used tcpdump to view the network traffic. I tested this with and without the ExifTool FastScan option, and repeated the test a few times to make sure I wasn't being fooled by other network traffic. I consistently saw about 50 packets transferred with FastScan set, and about 37000 packets without FastScan.

        So I conclude that this definitely works for me. I can't say what the difference is for you. I ran my tests on Mac OS X 10.4.11 with Perl 5.8.6.

        - Phil
        I prepared a test script which works for me, and transfers only as much of the image as necessary to extract the metadata:
        #! /usr/bin/perl -w # # File: url_info # # Author: Phil Harvey # # Syntax: url_info URL # # Description: test to get image info from image on the net # # Example: url_info http://owl.phy.queensu.ca/~phil/big.jpg # # References: Based on web crawler script: # http://www.linuxjournal.com/files/linuxjournal.com/lin +uxjournal/articles/022/2200/2200l1.html # use strict; use Image::ExifTool; sub url_info($); my $DEBUG = 0; # set to 1 for debugging network stuff my $url = shift or die "Syntax: url_info URL\n"; my $exifTool = new Image::ExifTool; $exifTool->Options(FastScan => 1); my $info = url_info($url); die "No image info for $url\n" unless $info; foreach (sort keys %$info) { print "$_ => $$info{$_}\n"; } exit 1; #--------------------------------------------------------------------- # Get the page at specified URL # Inputs: 0) URL # Returns: 0) #--------------------------------------------------------------------- sub url_info($) { my $url = shift; my ($protocol, $host, $port, $document) = $url =~ m|^([^:/]*)://([^:/]*):*([0-9]*)/*([^:]*)$|; # Some constants used to access the TCP network. my $AF_INET=2; my $SOCK_STREAM=1; # Use default http port if none specified. $port = 80 unless $port; # Get the protocol number for TCP. my ($name,$aliases,$proto) = getprotobyname("tcp"); # Get the IP addresses for the two hosts. my ($type,$len,$thataddr); ($name,$aliases,$type,$len,$thataddr) = gethostbyname($host); # Check we could resolve the server host name. return undef unless defined $thataddr; my ($a,$b,$c,$d) = unpack('C4', $thataddr); if (not defined $d or ($a eq '' && $b eq '' && $c eq '' && $d eq '')) { warn "Unknown host $host.\n"; return undef; } print "Server: $host ($a.$b.$c.$d)\n" if $DEBUG; # Pack the AF_INET magic number, the port, and the (already # packed) IP addresses into the same format as the C structure # would use. Note this is architecture dependent: this pack format # works for 32 bit architectures. my $that = pack("S n a4 x8", $AF_INET, $port, $thataddr); # Create the socket and connect. unless (socket(S, $AF_INET, $SOCK_STREAM, $proto)) { warn "Cannot create socket.\n"; alarm 0; return undef; } print "Socket OK\n" if $DEBUG; local $SIG{ALRM} = sub { die "ALARM\n" }; alarm 3600; # set timeout of 1 hour my $result = eval 'connect(S, $that)'; if ($@ or not $result) { warn "Cannot connect to server $host, port $port.\n"; alarm 0; return undef; } print "Connect OK\n" if $DEBUG; # Turn buffering in the socket off, and send request to the server select(S); $| = 1; select(STDOUT); $result=eval 'print S "GET /$document HTTP/1.1\nHost: $host\n\n"'; if ($@ or not $result) { warn "Timeout when sending to $host, port $port.\n"; alarm 0; return undef; } # Receive the response. Check to ensure the response is of MIME # type text/html or text/plain. my $header = 1; my $header_text = ""; for (;;) { $_ = eval '<S>'; if ($@) { warn "Timeout when reading from $host, port $port\n"; last; } last unless defined $_; # Check if we've hit the end of the HTTP header (empty line) # If we have, check for a content-type header line, and ensure # it is valid. if( m|^[\n\r]*$| ){ $header = 0; my ($content) = $header_text=~/Content-type: ([^\s;]+)/i; if ($content and $content =~ m{^image/(.*)}) { # extract image metadata my $raf = new File::RandomAccess(\*S); $info = $exifTool->ImageInfo($raf); } else { warn "Not an image\n"; } last; # all done } elsif($header == 1){ # Save to a header string if we're still working on the # HTTP header. $header_text .= " " . $_; } } eval 'close S'; alarm 0; print "HTTP header: \n $header_text" if $DEBUG; return $info; }
        - Phil
Re: How to download JUST the first X bytes of HTTP response ?
by Illuminatus (Curate) on Dec 07, 2010 at 17:14 UTC
    I'm not sure what you mean by 'making raw TCP connections'. Are you able to create pages and CGI scripts on the server? If so, you should be able to use LWP on the client side:
    1. Write a CGI script for the server that takes the path and name of a .jpg file, and sends a response page containing the metadata you want
    2. create a page on the server to invoke the CGI script
    3. write a client perl program using LWP that expects the page in the format you defined in (1)

    fnord