Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re: HTML Document Comparison

by merlyn (Sage)
on Sep 13, 2000 at 19:49 UTC ( [id://32286]=note: print w/replies, xml ) Need Help??


in reply to HTML Document Comparison

You could combine HTML::TokeParser with Algorithm::Diff to get the "number of differences", and if it's small enough, consider the files identical. Hmm. That sounds like an interesting problem... I'll put it on my WebTechniques todo pile. Thanks for the idea!

-- Randal L. Schwartz, Perl hacker

Replies are listed 'Best First'.
HTML Token Diff
by mdillon (Priest) on Sep 13, 2000 at 20:25 UTC
    how about this:
    #!/usr/bin/perl -w use strict; use Algorithm::Diff qw(diff LCS); use HTML::TokeParser; use LWP::Simple; sub tokenize_url { my $url = shift; my $content = get $url or die $!; my $p = new HTML::TokeParser(\$content); my (@tokens, $token); push @tokens, $token while (defined ($token = $p->get_token)); \@tokens; } my @content = map { tokenize_url($_) } qw{ http://perlmonks.org/index.pl?node_id=32285 http://perlmonks.org/index.pl?node_id=32286 }; # hash tokens based on their text content sub hash_token {$_[0][$_[0][0] eq 'T' ? 1 : -1]} my @diffs = diff $content[0], $content[1], \&hash_token; my @LCS = LCS $content[0], $content[1], \&hash_token; my $largest = 0; for my $hunk (@diffs) { my (@deletions, @additions); for (@$hunk) { push @deletions, $_ if $_->[0] eq '-'; push @additions, $_ if $_->[0] eq '+'; } my $size = @deletions > @additions ? @deletions : @additions; $largest = $size if $size > $largest; } print scalar(@{$content[0]}), " line", (@{$content[0]} == 1 ? '' : 's'), " in original", $/; print scalar(@{$content[1]}), " line", (@{$content[1]} == 1 ? '' : 's'), " in revision", $/; print scalar(@diffs), " hunk", (@diffs == 1 ? '' : 's'), " differ", $/; print $largest, " line", ($largest == 1 ? '' : 's'), " in largest hunk", $/; printf "Revision %0.2f%% similar to original$/", 100 * @LCS / @{$content[0]};

    updated 2001-Aug-01: small code changes; renamed from "RE: Re: HTML Document Comparison"

      Way cool! Now I can't look too hard at that if I'm going to reimplement it for the column, but way cool!

      Regarding the original poster's question, can you get some quantitization of "how much" of the file is changed, like 0 to 100%?

      -- Randal L. Schwartz, Perl hacker

        so, now that i've made a few changes, i think you could do this by keeping a running total of the hunk sizes and then comparing it to the number of lines in either the original or the revision. however, i'm not really sure what would be an appropriate heuristic. perhaps showing both $total_deletions / $original_lines and $total_additions / $revision_lines.

        you could also use LCS instead of diff and compare the size of the LCS (Longest Common Subsequence) to the size of the original or revised token list. this would allow you to say roughly "Revision is 80% similar to original" if @LCS / @original == 0.8.

        i have updated my old post to include this heuristic.

        merlyn, but how about some luxus in your script as well? ( I mean there always is.) I'm thinking about identifying something like: lines 20 to 26 in document A are not in document B, even if the rest is the same. This sometimes drives me crazy when comparing not Html but other txt.files? But it's just an idea.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://32286]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (2)
As of 2024-04-19 19:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found