Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,
I'm currently getting the date of a webpage by downloading the headers and using a regex to get the date. Yes, it's nasty, but it works most of the time, however on dynamic sites, it fails horribly.

Is there a correct way to get the date? I was thinking of something that parsed the metadata for 'date', or if that failed, something that looked thru the page for "date" looking formats and extracted them. Any ideas or modules that do this?

Replies are listed 'Best First'.
Re: date of a web page
by tachyon-II (Chaplain) on Jun 12, 2008 at 14:50 UTC

    There is nothing nasty about getting the Last-Modified time

    my ($lm_time) = $header =~ m/Last-Modified:\s+(.+)/i;

    With dynamic pages the concept of a last modified time would seem, by definition to be given by time() on your local system. Date::Parse will go hunting for strings that look like dates but this is likely to be horribly broken if applied to web pages. Apply it to this dynamic page for example.

    What are you actually trying to do?

Re: date of a web page
by pc88mxer (Vicar) on Jun 12, 2008 at 14:41 UTC
    My fist choice, if it isn't obvious my now :-), would be the str2time routine in Date::Parse.
Re: date of a web page
by ww (Archbishop) on Jun 12, 2008 at 23:26 UTC
    In any case, you can't count on the data you get from the site.

    Some static sites are careless about updating the metas (or the "latest update" info in the body of the page). Some dynamic sites will give you "today's date" every time you visit, despite the lack of any changes in the content.

    So, as an exercise, yours is a valid exercise. But if the date info has some critical meaning, be suspicious of what you get.

      Yes, I never trust anything on the net :-) This is mainly an exercise to see how well I can track news stories as they evolve over time. Mix in a bit a semantic analysis and see if the 'tone' of the stories evolve over time. Hey, It keeps me off the streets. Thanks for your suggestions monks.
Re: date of a web page
by Anonymous Monk on Jun 12, 2008 at 14:56 UTC
    OP here. Basically, I wanted to download various webpages, and order them by their posted date. For example, following a news story that evolves over time, I'd like to order them by date. The Date::Parse module sounds promising so long as there is only one date in the written page. Any other ideas to try?
      Does the site provide rss-feeds? That might often give you more resonable data. How often do your script check these sites? If it checks reasonably often, you could maybe use your own time when the site lack something better?
        no, unfortunatly not all sites offer rss yet, and even then, a lot don't contain dates. I don't want to run this all the time. Basically, I'd like to do a search on a particular topic on the site, get the results back, take those links and turn them into a date ordered list.