Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Quantitative Change instead of Boolean

by titivillus (Scribe)
on Jun 29, 2006 at 04:22 UTC ( [id://558209]=perlquestion: print w/replies, xml ) Need Help??

titivillus has asked for the wisdom of the Perl Monks concerning the following question:

I have a perl app that spies for me. It looks regularly at certain web pages and tells me if there has been a change. This is done in a fairly easily by taking an MD5 hash of the text of a web page and comparing it with a shared hash saved in a DBM file. (Minus a certain amount of fudging to make sure things like revolving web ads don't mess me up.) Within those constraints, it's a boolean: was there a change? Y/N.

What I'm now thinking of is quantitative change. I'm hoping to find a way to say how much or how little things have changed, preferrably without saving the whole text as a value in a DBM file. Yes, I know, disk is cheap, but it seems like I'm being insufficiently clever if I accept that that's the only way to do it.

.sig goes here

Replies are listed 'Best First'.
Re: Quantitative Change instead of Boolean
by Corion (Patriarch) on Jun 29, 2006 at 07:14 UTC

    One not-so crude method of establishing the similarity between texts was presented to me in String::Trigram, which was used by its author for just that, determining how similar two webpages are. The method is still crude as it doesn't discriminate between HTML and "real" text, but the author claimed it worked well enough for him because minor differences like timestamps etc. made only a small change in the resulting number.

      Still crude, not-so-crude... it just depends on how you twiddle and tweak your ideas - at the beginning often crude or highly volatile - to which ends... ,)

      cheers,
      --shmem

      _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                    /\_¯/(q    /
      ----------------------------  \__(m.====·.(_("always off the crowd"))."·
      ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: Quantitative Change instead of Boolean
by shmem (Chancellor) on Jun 29, 2006 at 06:36 UTC
    Sounds to me like you're thinking about something like a "web diff". The problem here is, the rules are not well established as for, say, diff and patch, and telling big from minor changes is not easy either. E.g. you could move block level elements around in the source and see no difference in rendering, the rules for that being in a css.

    If you are interested in text only, it's easy I think. Text in normal files breaks at line ends, which is not the case in HTML. I'd suggest stripping the text from the HTML (with e.g. Tom Christiansens striphtml) removing empty lines and leading/trailing whitespace, jam it together and break it again at punctuation. The text between punctuations are your lines then, which you could run through diff.

    If you are interested in markup changes/layout, you could compile the page into a DOM tree (e.g. with HTML::Tree), and compare it's content starting from the twigs.

    Just some ideas...

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
      Sounds to me like you're thinking about something like a "web diff".
      Not really. Or maybe "yes, but...". A web diff, fully speaking, should point out the changes. I'm hoping to quantify the difference between a small change, like "18 comments" instead of "3 comments", and a big change, like ... new blog posts, (Yes, RSS. I do that too.) or anything else that's more significant than a spelling change or a raised digit. But thanks for the help.

      .sig goes here

Re: Quantitative Change instead of Boolean
by esskar (Deacon) on Jun 29, 2006 at 05:55 UTC
    just for me to understand:

    you said "Minus a certain amount of fudging to make sure things like revolving web ads don't mess me up."

    how do you do this using MD5. i mean, changing one bit in the data will result in a totally different MD5 digest, won't it?
      He always removes the ads before using MD5. If the only difference between the page and its previously downloaded version are the ads, the MD5 digest will match. The pages aren't being compared; the pages without the ads are being compared.
Re: Quantitative Change instead of Boolean
by Eimi Metamorphoumai (Deacon) on Jun 30, 2006 at 15:26 UTC
    I'd suggest you look at Digest::Nilsimsa, which should be a pretty good starting point. It creates something like a hash, but set up so that "similar" text produces similar hashes.
      I've taken a look at Digest::Nilsimsa. It is a cool thing. However, I've found a problem.
      for my $d ( 30 .. 36 ) { my $this = $nil->text2digest( ( 'a' x $d ) . 'b' ) ; my $that = $nil->text2digest( ( 'a' x ( $d + 1 ) ) ) ; print nilcomp( $this , $that ) ; } sub nilcomp { my $diff = 0 ; my $diff2 = 0 ; my @this = split /|/ , shift ; my @that = split /|/ , shift ; for my $a ( 0 .. scalar(@this)-1 ) { $diff++ if $this[$a] ne $that[$a] ; my $is = hex $this[$a] ; my $at = hex $that[$a] ; if ( $is != $at ) { $diff2 += abs $is - $at ; } } return ( join "" , @this) . qq(\n) . ( join "" , @that) . qq(\n) . $diff . qq( characters different\n) . ( abs $diff2 ) . qq( bits different\n\n); }
      gives you
      000000000000900000010021000008105000080010000004000c400000000008 0000000000009000000000200000080040000000000000040008400000000000 8 characters different 25 bits different 000000000000900000010021000008105000080010000004000c400000000008 0000000000009000000000200000080040000000000000040008400000000000 8 characters different 25 bits different 000000000000900000010021000008105000080010000004000c400000000008 0000000000009000000000200000080040000000000000040008400000000000 8 characters different 25 bits different 000000000000900000010021000008105000080010000004000c400000000008 0000000000009000000000200000080040000000000000040008400000000000 8 characters different 25 bits different 000000000000900000010021000008105000080010000004000c400000000008 0000000000009000000000200000080040000000000000040008400000000000 8 characters different 25 bits different 0000000000009000000000200000080040000000000000040008400000000000 0000000000009000000000200000080040000000000000040008400000000000 0 characters different 0 bits different 0000000000009000000000200000080040000000000000040008400000000000 0000000000009000000000200000080040000000000000040008400000000000 0 characters different 0 bits different
      If all I was trying to use this on were 35-character data sets, that'd be cool, but I'm trying to run this on whole web pages. I pull out all markup and whitespace, I'll still be in the headers by the time the 35th character rolled around. I love it in theory, but in practice, the data's too big for the module. So, I could do this:  $output =~ s[(.{35})][$nilsimsa->text2digest($1)]ge ; or something of the sort, but that seems ... goofy. But it does point out that taking length $output and comparing it to last time should indicate a small change, if they're only a small number of characters apart.

      .sig goes here

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://558209]
Approved by spiritway
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (6)
As of 2024-03-28 19:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found