comment on

A better solution (if pseudo identical is not good enough as in the previous post) is to use Algorithm::Diff There is a good discussion of this from merlyn at The ever useful Web Techniques

Here is an implementation (we return 2 results as there are two answers ie similarity of A to B and B to A - not necessarily the same). Note I am pushing into arrays as your post hinted you were interested in the matches, if not then all you really need is counters. As an aside you could apply this approach to the detection of plagurism:

$, = '|';
$DEBUG = 1;

print compare( 'Hello', 'hello' ), $/;
print compare( 'Hello', 'HELLO WORLD' ), $/;
print compare( 'The quick brown fox jumped over the lazy dogs.', 
               'The quick brown dogs jumped over the lazy fox.' ), $/;
print compare( 'The quick brown fox jumped over the lazy dogs.', 
               'The quick brown fox jumped over the lazy kangaroo.' ),
+ $/;
print compare( 'The quick brown fox jumped over the lazy dogs.', 
               'The quick brown fox jumped, tripped and broke its neck
+.' ), $/;


use Algorithm::Diff qw(traverse_sequences);

sub compare {
    my ( $str1, $str2 ) = @_;
    print "\nCompare '$str1' <=> '$str2'\n" if $DEBUG;
    my $tok_str1 = tokenize($str1);
    my $tok_str2 = tokenize($str2);
    my (@match,@str1, @str2);
    traverse_sequences( $tok_str1, $tok_str2, {
        MATCH => sub { push @match, $tok_str1->[$_[0]] },
        DISCARD_A => sub { push @str1, $tok_str1->[$_[0]] },
        DISCARD_B => sub { push @str2, $tok_str2->[$_[1]] },
    });
    print "'@match' '@str1' '@str2'\n" if $DEBUG;
  return @match/(@match+@str1), @match/(@match+@str2);
}



sub tokenize {
    my ($str) = @_;
    # remove punctuation stuff
    $str =~ s/[^A-Za-z0-9 ]+//g;
    # lowercase
    $str = lc $str;
    # return array ref
  return [split ' ', $str];
}

__DATA__
Compare 'Hello' <=> 'hello'
'hello' '' ''
1|1|

Compare 'Hello' <=> 'HELLO WORLD'
'hello' '' 'world'
1|0.5|

Compare 'The quick brown fox jumped over the lazy dogs.' <=> 'The quic
+k brown dogs jumped over the lazy fox.'
'the quick brown jumped over the lazy' 'fox dogs' 'dogs fox'
0.777777777777778|0.777777777777778|

Compare 'The quick brown fox jumped over the lazy dogs.' <=> 'The quic
+k brown fox jumped over the lazy kangaroo.'
'the quick brown fox jumped over the lazy' 'dogs' 'kangaroo'
0.888888888888889|0.888888888888889|

Compare 'The quick brown fox jumped over the lazy dogs.' <=> 'The quic
+k brown fox jumped, tripped and broke its neck.'
'the quick brown fox jumped' 'over the lazy dogs' 'tripped and broke i
+ts neck'
0.555555555555556|0.5|
[download]

cheers

tachyon

s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

In reply to Re: calculate matching words/sentence by tachyon
in thread calculate matching words/sentence by anocelot

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.