xjar has asked for the wisdom of the Perl Monks concerning the following question:
Hello all. I need to write a program to compare two HTML documents to determine if they are similar enough to be considered "the same". What I was thinking of doing is this (keep in mind, I'm a neophyte, so if my ideas are pretty poor, be kind):
Read each document into an array, line by line
Strip the newline off of each array element
"Concatenate" each array element into a string variable, so that in the end, each variable will hold an entire document
Take a substr() of each variable, say 150 characters in, and then take 100 characters from there. If the two are the same, then the documents are the same.
Now, I'm not sure how efficient this will be, especially with the swapping from array to variable. Can anyone provide me with some ideas, or even (hehe) a module that can help with this?
Much thanks, xjar
Re: HTML Document Comparison
by merlyn (Sage) on Sep 13, 2000 at 19:49 UTC
|
| [reply] [Watch: Dir/Any] |
|
#!/usr/bin/perl -w
use strict;
use Algorithm::Diff qw(diff LCS);
use HTML::TokeParser;
use LWP::Simple;
sub tokenize_url
{
my $url = shift;
my $content = get $url or die $!;
my $p = new HTML::TokeParser(\$content);
my (@tokens, $token);
push @tokens, $token
while (defined ($token = $p->get_token));
\@tokens;
}
my @content = map { tokenize_url($_) } qw{
http://perlmonks.org/index.pl?node_id=32285
http://perlmonks.org/index.pl?node_id=32286
};
# hash tokens based on their text content
sub hash_token {$_[0][$_[0][0] eq 'T' ? 1 : -1]}
my @diffs = diff $content[0], $content[1], \&hash_token;
my @LCS = LCS $content[0], $content[1], \&hash_token;
my $largest = 0;
for my $hunk (@diffs)
{
my (@deletions, @additions);
for (@$hunk)
{
push @deletions, $_ if $_->[0] eq '-';
push @additions, $_ if $_->[0] eq '+';
}
my $size = @deletions > @additions ? @deletions : @additions;
$largest = $size if $size > $largest;
}
print scalar(@{$content[0]}), " line",
(@{$content[0]} == 1 ? '' : 's'), " in original", $/;
print scalar(@{$content[1]}), " line",
(@{$content[1]} == 1 ? '' : 's'), " in revision", $/;
print scalar(@diffs), " hunk", (@diffs == 1 ? '' : 's'),
" differ", $/;
print $largest, " line", ($largest == 1 ? '' : 's'),
" in largest hunk", $/;
printf "Revision %0.2f%% similar to original$/",
100 * @LCS / @{$content[0]};
updated 2001-Aug-01: small code changes; renamed from "RE: Re: HTML Document Comparison" | [reply] [Watch: Dir/Any] [d/l] |
|
Way cool! Now I can't look too hard at that if I'm going to reimplement it for
the column, but way cool!
Regarding the original poster's question, can you get some quantitization of
"how much" of the file is changed, like 0 to 100%?
-- Randal L. Schwartz, Perl hacker
| [reply] [Watch: Dir/Any] |
|
|
Re: HTML Document Comparison
by moen (Hermit) on Sep 14, 2000 at 01:37 UTC
|
k, newbie answer ahead.
Something like this gave me the proper answer 0||1 for equality using String::Approx. Of course not any validity against proper html code or anything, just plain text match.
use String::Approx 'amatch';
$match = amatch(@txt1, @txt2));
where @txt1 is the original document and @txt2 is the comparing document. And $match will give you 0||1 for the match.
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
| [reply] [Watch: Dir/Any] |
Re: HTML Document Comparison
by xjar (Pilgrim) on Sep 13, 2000 at 22:00 UTC
|
Thank you both, merlyn and mdillon, for the help... heh, this is one of the reasons i love perl monks so much!
maybe with some minor modification, it looks like mdillon's code might be just what i was looking for, and seemingly far more efficient than what i proposed.
much thanks, xjar | [reply] [Watch: Dir/Any] |
Re: HTML Document Comparison
by cbraga (Pilgrim) on Sep 14, 2000 at 01:25 UTC
|
How about that:
1. Read the document and strip off everything except the html tags, including all newlines;
2. Take the MD5 hash of the tag structure;
3. Compare the MD5s of the documents to determine if they have the same structure. Or are derived from the same template with different text.
Advantadges:
* Accounts for markup changes, so differences in the text are not significant;
* If you have a lot of documents, MD5s are easy and quick to compare, as opposed to whole documents.
Disadvantadges:
* Accounts for markup changes, so differences in the text are not significant;
* MD5s are very strict, so there's no telling between small and big differences.
Actually this isn't my idea, it was used in some web survey to count the number of unique sites on the net.
| [reply] [Watch: Dir/Any] |
|
Yah NetCraft.
Which is a great idea, EXCEPT that the changes may be
text body within a template. They were looking for templated
documents specifically.
Still, a cool idea worthy of the ++ I stuck on it =)
--
$you = new YOU;
honk() if $you->love(perl)
| [reply] [Watch: Dir/Any] |
Re: HTML Document Comparison
by planetscape (Chancellor) on Mar 22, 2008 at 21:07 UTC
|
| [reply] [Watch: Dir/Any] |
Re: HTML Document Comparison
by ww (Archbishop) on Mar 22, 2008 at 21:47 UTC
|
This is a little late for OP, probably, but since planetscape has cross-referenced this thread with a more recent one, what follows may still have some value for future readers
.."to determine if they are similar enough to be considered "the same"
There are some great answers, above, but note that some of them touch on what seems to me to be the threshold problem. To be explicit:
What are the criteria for sameness? For one example, are two pages "to be considered the same" if one presents a given data set as a piechart and another as a barchart? ...in tabular fashion rather than as a block of lines?
More generally, are we trying to see if:
- the two documents render the same content with mere typographic differences occuring solely because of variant markup? (and what makes those differences "mere?")
- the content is different, but the general appearance (layout) is the same or similar? (and for what value of "similar?")
If you sample only segments of the comparatives, how much risk of a false positive are you willing to accept?
Much as I admire the posts above, I think you need to answer these (and similar) questions for your project before adopting a method.
| [reply] [Watch: Dir/Any] |
|
|