in reply to Downloading image files using LWP

Are you sure it uses a local copy? I'd think there is a proxy between you and the server that does this.

Anyway ... if you need to be really sure try to append the timestamp to the URL:

my $request = HTTP::Request->new(GET => $url . '?' . time());

The parameter will most likely be ignored by the webserver, but the proxies will not dare to interfere.

Of course if there already is a query in the URL add a new parameter only :

my $request = HTTP::Request->new(GET => $url . '&tImEstaMp=' . time() +);

Jenda

Replies are listed 'Best First'.
Re: Re: Downloading image files using LWP
by gnangia (Scribe) on Nov 05, 2002 at 22:28 UTC
    Ok here is my code,
    while( my $url = shift @urls) { print "URL is $url\n"; my $request = HTTP::Request->new(GET => $url); my $parser = HTML::Parser->new(api_version => 3); $parser->handler(start => \&start,'self,tagname,attr'); my $response = $browser->request($request); if ($response->is_success) { print $response->content(); $parser->{base} ||= $response->base; $parser->{browser} ||= $browser; $parser->parse($response->content); $parser->eof(); } else { print "ERROR: " . $response->status_line . "\n"; } } sub start + { my ($parser,$tagname,$attr)= @_; if ($tagname eq 'img') { if ($attr->{src}) + { + my $img_url = $attr->{src}; + my $remote_name =URI->new_abs($img_url,$parser +->{base}); #my ($local_name) = $img_url =~ m!([^/]+)$!; + my $local_name = $remote_name->host . $remote_ +name->path ; + #my $local_name = "/dev/null"; + mkpath(dirname($local_name),0,0711); + print "Getting imagefile: $img_url\n"; + my $response = $parser->{browser}->mirror($rem +ote_name,$ local_name); + print STDERR "YYY-$local_name: ",$response->me +ssage,"\n" ; + } + } + }
    Here is the output when I run it the second time Getting imagefile: images/logo.gif
    LWP::UserAgent::mirror: () LWP::UserAgent::request: () HTTP::Cookies::add_cookie_header: Checking www.google.com for cookies
    HTTP::Cookies::add_cookie_header: Checking .google.com for cookies
    HTTP::Cookies::add_cookie_header: - checking cookie path=/
    HTTP::Cookies::add_cookie_header: - checking cookie PREF=ID=0f9d8bbb3b0ee898:TM =1036535059:LM=1036535059:S=2ea2eKPQlO4uYAN6
    HTTP::Cookies::add_cookie_header: it's a match
    HTTP::Cookies::add_cookie_header: Checking google.com for cookies
    HTTP::Cookies::add_cookie_header: Checking .com for cookies
    LWP::UserAgent::send_request: GET http://www.google.com/images/logo.gif
    LWP::UserAgent::_need_proxy: Not proxied
    LWP::Protocol::http::request: ()
    LWP::UserAgent::request: Simple response: Not Modified
    YYY-www.google.com/images/logo.gif: 304 Not Modified
      You are using LWP::UserAgent::mirror() which does the local caching. That checks for the local file, uses its timestamp in a If-Modified-Since header, and does a conditional GET.

      Since you want to force the file to be downloaded, either don't use mirror, or delete the local file before you call it.

      The UserAgent request method takes a filename as the second parameter. It will create (or overwrite) the file with the downloaded contents. You should check that the download succeed and returned the expected number of bytes.

        That was it. I ran a sniffer trace and found that the request was sending in the header "IF-Modified-Since" which was causing google.com to send a reply with html code 304 (Not Modified). So I modified my last subroutine loop where I request the image url as follows -
        my $img_url = $attr->{src}; my $remote_name =URI->new_abs($img_url,$parser->{base}); my $local_name=$remote_name->host . $remote_name->path; mkpath(dirname($local_name),0,0711); print "Getting imagefile: $img_url\n"; $request = HTTP::Request->new(GET => $remote_name); my $response = $browser->request($request, $local_name); print STDERR "YYY-$local_name: ",$response->status_line,
        It works now. Thanks to everyone for their input.

      Are you sure that the remote image is getting changed between iterations?? If not, then the program is doing what it should. If you want the image regardless of wether or not it has been modified, then delete or rename the local copy.

      To test, create a logo.gif file and copy it in place of the cached version before making the new request. It should notice a later last modified time and get the newer file.

      ~Hammy

      At a guess, I'd say lose that cookie between runs.  It looks like Google's being smart about whether you already got it :).

        p