in reply to Re: Downloading image files using LWP
in thread Downloading image files using LWP

Ok here is my code,
while( my $url = shift @urls) { print "URL is $url\n"; my $request = HTTP::Request->new(GET => $url); my $parser = HTML::Parser->new(api_version => 3); $parser->handler(start => \&start,'self,tagname,attr'); my $response = $browser->request($request); if ($response->is_success) { print $response->content(); $parser->{base} ||= $response->base; $parser->{browser} ||= $browser; $parser->parse($response->content); $parser->eof(); } else { print "ERROR: " . $response->status_line . "\n"; } } sub start + { my ($parser,$tagname,$attr)= @_; if ($tagname eq 'img') { if ($attr->{src}) + { + my $img_url = $attr->{src}; + my $remote_name =URI->new_abs($img_url,$parser +->{base}); #my ($local_name) = $img_url =~ m!([^/]+)$!; + my $local_name = $remote_name->host . $remote_ +name->path ; + #my $local_name = "/dev/null"; + mkpath(dirname($local_name),0,0711); + print "Getting imagefile: $img_url\n"; + my $response = $parser->{browser}->mirror($rem +ote_name,$ local_name); + print STDERR "YYY-$local_name: ",$response->me +ssage,"\n" ; + } + } + }
Here is the output when I run it the second time Getting imagefile: images/logo.gif
LWP::UserAgent::mirror: () LWP::UserAgent::request: () HTTP::Cookies::add_cookie_header: Checking www.google.com for cookies
HTTP::Cookies::add_cookie_header: Checking .google.com for cookies
HTTP::Cookies::add_cookie_header: - checking cookie path=/
HTTP::Cookies::add_cookie_header: - checking cookie PREF=ID=0f9d8bbb3b0ee898:TM =1036535059:LM=1036535059:S=2ea2eKPQlO4uYAN6
HTTP::Cookies::add_cookie_header: it's a match
HTTP::Cookies::add_cookie_header: Checking google.com for cookies
HTTP::Cookies::add_cookie_header: Checking .com for cookies
LWP::UserAgent::send_request: GET http://www.google.com/images/logo.gif
LWP::UserAgent::_need_proxy: Not proxied
LWP::Protocol::http::request: ()
LWP::UserAgent::request: Simple response: Not Modified
YYY-www.google.com/images/logo.gif: 304 Not Modified

Replies are listed 'Best First'.
Re: Re: Re: Downloading image files using LWP
by iburrell (Chaplain) on Nov 06, 2002 at 02:01 UTC
    You are using LWP::UserAgent::mirror() which does the local caching. That checks for the local file, uses its timestamp in a If-Modified-Since header, and does a conditional GET.

    Since you want to force the file to be downloaded, either don't use mirror, or delete the local file before you call it.

    The UserAgent request method takes a filename as the second parameter. It will create (or overwrite) the file with the downloaded contents. You should check that the download succeed and returned the expected number of bytes.

      That was it. I ran a sniffer trace and found that the request was sending in the header "IF-Modified-Since" which was causing google.com to send a reply with html code 304 (Not Modified). So I modified my last subroutine loop where I request the image url as follows -
      my $img_url = $attr->{src}; my $remote_name =URI->new_abs($img_url,$parser->{base}); my $local_name=$remote_name->host . $remote_name->path; mkpath(dirname($local_name),0,0711); print "Getting imagefile: $img_url\n"; $request = HTTP::Request->new(GET => $remote_name); my $response = $browser->request($request, $local_name); print STDERR "YYY-$local_name: ",$response->status_line,
      It works now. Thanks to everyone for their input.
Re: Re: Re: Downloading image files using LWP
by HamNRye (Monk) on Nov 06, 2002 at 00:07 UTC

    Are you sure that the remote image is getting changed between iterations?? If not, then the program is doing what it should. If you want the image regardless of wether or not it has been modified, then delete or rename the local copy.

    To test, create a logo.gif file and copy it in place of the cached version before making the new request. It should notice a later last modified time and get the newer file.

    ~Hammy

Re: Re: Re: Downloading image files using LWP
by petral (Curate) on Nov 06, 2002 at 00:04 UTC
    At a guess, I'd say lose that cookie between runs.  It looks like Google's being smart about whether you already got it :).

      p