comment on

Something like this may be suitable for you. It will return 1 if the strings are pseudo-identical and 0 if they are completely different. It will return values between 0 and 1 with the value increasing as the similaritly increases. Pseudo identical is the appropriate word as we don't consisder word order or word frequency (where the same word appears more than once). This may or not matter to you.

I uses just one loop and a hash table so should not be glacial. You can tokenize any way you like, I remove punctiation and lower case....

print compare( 'Hello', 'hello' ), $/;  # 1
print compare( 'Hello', 'HELLO WORLD' ), $/; # 0.5
print compare( 'The quick brown fox jumped over the lazy dogs.', 
               'The quick brown dogs jumped over the lazy fox.' ), $/;
+ # 1
print compare( 'The quick brown fox jumped over the lazy dogs.', 
               'The quick brown dogs jumped over the lazy kangaroo.' )
+; # 0.888



sub compare {
    my ( $str1, $str2 ) = @_;
    my $tok_str1 = tokenize($str1);
    my $tok_str2 = tokenize($str2);
    # swap unless @$tok_str1 contains the most tokens
    ($tok_str1, $tok_str2) = ($tok_str2, $tok_str1) if @$tok_str2 > @$
+tok_str1;
    # make a lookup hash for the smaller numer of tokens in str2
    my %h;
    @h{@$tok_str2} = ();  # slice syntax if fastest
    # now scan str1 for these tokens and count
    my $found = 0;
    for my $tok ( @$tok_str1 ) {
        $found++ if exists $h{$tok};
    }
    my $similarity = $found/@$tok_str1;
  return $similarity;
}



sub tokenize {
    my ($str) = @_;
    # remove punctuation stuff
    $str =~ s/[^A-Za-z0-9 ]+//g;
    # lowercase
    $str = lc $str;
    # magic whitespace split and return array ref
  return [split ' ', $str];
}
[download]

cheers

tachyon

s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

In reply to Re: calculate matching words/sentence by tachyon
in thread calculate matching words/sentence by anocelot

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.