Well, I like this anyway. Run it occasionally to see if the content of a web page has changed significantly since you last saw it. Uses a cunning md5 hash method so that content moving around the page still gets registered as old content. Someone with a maths backgroud might call it a set intersection, but that's their problem.
This is more reliable than using the last modified time of the page, as for generated stuff, that is always 'now' even though the content might not actually have changed.
If you add sites then remove them later, they'll still be keys in the DB_File, so you might want to clean that out occasionally. I don't do it in the script as I have one running hourly, daily and weekly...
use strict; use LWP::Simple; use DB_File; use Digest::MD5 qw(md5_hex); my $alert_amount = 0.1; # difference to alert my @sites = ( 'http://www.herdofsheep.com', 'http://www.theonion.com', 'http://news.bbc.co.uk', ); my $db_file = 'sites.db'; my (%sites_db); tie %sites_db, "DB_File", "$db_file", O_RDWR|O_CREAT or die "Cannot op +en database '$db_file': $!"; foreach (@sites) { my ($lines, $matches, $new_entry); my %new = map { md5_hex($_) => 1 } split(/<.*?>/, get($_)); unless (%new) {print "Could not get $_\n"; next} my %old = map { $_ => 1 } split(',', $sites_db{$_}) if exists($sites +_db{$_}); foreach (keys %new) { $lines++; $new_entry .= $_ . ','; $matches++ if $old{$_}; } if ((($lines - $matches) / $lines) > $alert_amount ) { print "$_ seems to be different\n"; $sites_db{$_} = $new_entry; } }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Never see the same page twice
by artist (Parson) on Jul 26, 2003 at 05:47 UTC |