The question is vague as to the intent and scope of the problem. The best I could surmise is this might relate to plagiarism detection. How literally do you mean the "based on word content"?
One rudimentary approach is to try compress the text units first separately, then together (as a "solid archive"), for an estimate of entropy.
Just something to toy with. There might be a better way to get at the compressed size.#! /usr/bin/perl use strict; use warnings; package PerlIO::via::Count { sub PUSHED { bless [], shift } sub WRITE { $_[0][0] += length $_[1]; length $_[1] } sub TELL { $_[0][0] } } use IO::Compress::Lzma; sub zce { my ($s, $t) = map { open my $cnt, ">:via(Count)", \my $null; [ $cnt, new IO::Compress::Lzma $cnt ] } 0, 0; for my $file (@_) { local $/ = \8192; open my $fh, '<', $file; while (<$fh>) { $s->[1]->write($_); $t->[1]->write($_); } $t->[1]->newStream; } $_ = ( $_->[1]->close, $_->[0]->tell ) for ($s, $t); printf "%3.0f%% %s\n", 100.0 * (1-$s/$t) / (1-1/@_), join ' ', @_; } (@ARGV = grep -f, @ARGV) > 1 or exit; # zce(@ARGV[$_-1,$_]) for 1..$#ARGV; zce(@ARGV);
In reply to Re: Similarity measurement
by oiskuu
in thread Similarity measurement
by kennedy
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |