Well, I like this anyway. Run it occasionally to see if the content of a web page has changed significantly since you last saw it. Uses a cunning md5 hash method so that content moving around the page still gets registered as old content. Someone with a maths backgroud might call it a set intersection, but that's their problem.

This is more reliable than using the last modified time of the page, as for generated stuff, that is always 'now' even though the content might not actually have changed.

If you add sites then remove them later, they'll still be keys in the DB_File, so you might want to clean that out occasionally. I don't do it in the script as I have one running hourly, daily and weekly...

use strict; use LWP::Simple; use DB_File; use Digest::MD5 qw(md5_hex); my $alert_amount = 0.1; # difference to alert my @sites = ( 'http://www.herdofsheep.com', 'http://www.theonion.com', 'http://news.bbc.co.uk', ); my $db_file = 'sites.db'; my (%sites_db); tie %sites_db, "DB_File", "$db_file", O_RDWR|O_CREAT or die "Cannot op +en database '$db_file': $!"; foreach (@sites) { my ($lines, $matches, $new_entry); my %new = map { md5_hex($_) => 1 } split(/<.*?>/, get($_)); unless (%new) {print "Could not get $_\n"; next} my %old = map { $_ => 1 } split(',', $sites_db{$_}) if exists($sites +_db{$_}); foreach (keys %new) { $lines++; $new_entry .= $_ . ','; $matches++ if $old{$_}; } if ((($lines - $matches) / $lines) > $alert_amount ) { print "$_ seems to be different\n"; $sites_db{$_} = $new_entry; } }

In reply to Never see the same page twice by quidity

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.