Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm retreiving webpages using www::mechanize. I need to retreive the page if the content of the webpage has changed. How can I do this?
  • Comment on www::mechanize check if webpage has updated

Replies are listed 'Best First'.
Re: www::mechanize check if webpage has updated
by JavaFan (Canon) on Sep 16, 2010 at 12:00 UTC
    Store the time you made the latest retrieval. Issue a HEAD request with the 'If-Modified-Since' header set. Don't not do a retrieval if the response is a '304 Not Modified'.

    Of course, a webserver may not honor the 'If-Modified-Since' header, or send a 200 anyway. There's no guaranteed way of finding out whether a page has changed without retrieving it.

      the problem is that the page I want to verify is obtained through a submit_form. So even if I can verify after the http::response object is obtained its fine. Like maybe compare with a saved version

        Instead of a saved version, just store the hash of the saved version and do nothing if the hash is the same.

        Instead of hashing the whole page, just hash the portion(s) of the content that you care about. That way different advertisements or updates to the nav links in the header/footer won't throw you off (unless you care about those sorts of changes too).

Re: www::mechanize check if webpage has updated
by marto (Cardinal) on Sep 16, 2010 at 11:58 UTC

    Do you mean page content or what gets displayed to the user? The reason I ask is that pages (or sections of pages, for example a news section) could be updated via AJAX, so while the JavaScript and HTML may not have changed what gets displayed to the user by dynamic methods may be different.

      Even without dynamic content, what's displayed to the user may change. A page may contain a picture, and the picture (but not the URL) may change (think page counters).
      html..more tags get added

        If you purely care about HTML tags use WWW::Mechanize's content method to get the page. Then you may want to use one of the HTML parsing modules to compare tags, or if what you're really trying to do is monitor for changes to static HTML content, sore and compare $mech->content() periodically for changes.</c>