in reply to Re: How to get web creation date from webserver?
in thread How to get web creation date from webserver?

Hi all,

Using httrack, by given url i have download the page and working in offline. I have to go and check the url daily the page as been modified or not. If the page as been modified i have to download else i have to exit. so, for this purpose i want to get the date is it possible help me.

  • Comment on Re^2: How to get web creation date from webserver?

Replies are listed 'Best First'.
Re^3: How to get web creation date from webserver?
by jhourcle (Prior) on Aug 23, 2005 at 10:25 UTC

    Read the HTTP specification. Specifically, section 14.25, 'If-Modified-Since'.

    You return the 'Last-Modified' timestamp from when you cached the file (or the date you got it, but then you have to deal with generating the date format), and if the file hasn't been modified, and the webserver supports this header, it should return a '304' status message, rather than the full content all over again.

Re^3: How to get web creation date from webserver?
by holli (Abbot) on Aug 23, 2005 at 07:08 UTC
    So you don't want to know when the page has been created, you want to know if the page has changed since you have last visited/downloaded it. I don't know of a readymade perl way to do this, but there are a lot of of programs, e.g. webmon.


    holli, /regexed monk/
      You could fetch the page with LWP, calculate & store an MD5 checksum, then simply compare the current checksum with the last one.

      Code hastily snipped and sanitised :)

      use Digest::MD5 qw/md5_hex/; sub web_MD5 { # get MD5 sum of an url my $url = shift; my $ua = LWP::UserAgent->new(env_proxy => 1, keep_alive => 1, timeout => 30); my $response = $ua->get($url); unless ($response->is_success) { # failed to fetch print "Error fetching ", $url, " ", $response->status_line; } warn "Error while getting ", $response->request->uri, " -- ", $response->status_line, "\nAborting"; unless $response->is_success; my $doc = $response->content(); my $md5 = md5_hex($doc); undef $ua; return $md5; }
        A few suggestions:

        Once you have fetched the page you could simply and blindly write it over the old version to the disk. That would be cheaper than calculating a checksum, esp. because you have to download the page twice (your function does not return the fetched data). So i would at least alter your code.
        return ($md5, \$response);
        Also, you do not return (but warn) when when the fetch fails. I didn't check what happens when you md5_hexing an undef value and what the code returns. It should return undef to indicate the failure to the caller. So, even if your code works, it would be simply clearer when you explicitly return undef when the fetch fails.
        unless ($response->is_success) { warn "Error while getting ", $response->request->uri, " -- ", $response->status_line, "\nAborting"; return; }


        holli, /regexed monk/

      Dear holli,

      Is it possible that any tool for linux. And also, i want to check automatically and based on that i want to run perl file to download. If based on webmon means i have to check daily anyother way to get information automatically the changes made in the page information Thanks.

        The commandline tool wget supports mirroring of sites (or pages) and has options for getting newer files or not (see man page). As for the "check daily" part, sounds like this should be crontab'd.