Similarity of strings

professa has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Similarity of strings by ariels (Curate) on May 15, 2002 at 14:51 UTC
Use the XOR! `my $x = "EKFCNSVVDLYSNLSKPFLDIVLYIFKLTSAIGAQGPA----SMMAYLVV"; my $y = "ERFCRQLSSMASKLI-------ISPFTLVYYTYQCFQSTGWPVSIFGYFI"; my $z = $x^$y; my $score = ($z =~ tr/\0//); return $score/length($x)` [download]	[reply] [d/l]
Re: Re: Similarity of strings by professa (Beadle) on May 15, 2002 at 15:35 UTC
XOR seems to be the fastest method! It needs ~3 seconds instead of ~7 seconds using SUBSTR. Well, more than twice the speed, cool! ;-) Thanx, Micha	[reply]
Re: Similarity of strings by VSarkiss (Monsignor) on May 15, 2002 at 14:52 UTC
I'm not sure if this is exactly what you're looking for, but you can find longest common subsequences with Algorithm::Diff. With that, you could find the lengths of the differing sequences and divide that by the total length. (If you have trouble understanding the documentation for Algorithm::Diff, I wrote a module review which may help. If the code snippet you have above is an accurate description of what you're trying to calculate, it may be faster (though more memory-intensive) to split up the strings into arrays and compare an element at a time, rather than calling substr over and over. Something like this: (note, this is untested) `my @ref_elems = split //, $ref_seq; my @test_elems = split //, $test_seq; my $score = 0; for (my $i = 0; $i < $len; $i++) { $score += $ref_elems[$i] eq $test_elems[$i]; }` [download] Once you have the sequences in arrays, you can use all kinds of nifty techniques like mapcar, which can traverse both arrays in one neat statement. The top of that node has a very clear explanation of how to use it. HTH	[reply] [d/l]
Re: Re: Similarity of strings by professa (Beadle) on May 15, 2002 at 15:22 UTC
I tested splitting the strings up into arrays and timed the two methods (simply via 'time script.pl'). The `split`-method takes ~13 seconds to finish, the `substr`-method only ~7 seconds. The advantages of having the data ready in arrays doesn't count for me, I just need the percentage of similarity, and as fast as possible. ;-) I'll try out the rest of the suggested methods here and report which does best. Thanx, Micha	[reply] [d/l] [select]
Some of the above suggestions, benchmarked (Re: Similarity of strings) by cmilfo (Hermit) on May 15, 2002 at 19:08 UTC
I tried to gather some of the ideas above and merge them into one benchmarked thing. Also, I benchmarked replacing the `split //` and `chop $var while $var` with `unpack`. Anyway, hope you enjoy. #!/usr/bin/perl use Benchmark qw(timethese cmpthese); my $string1 = 'This is a test of the emergency system!'; my $string2 = 'This is a test of the emergency broadcast system!'; my $template1 = "a1" x (length $string1); my $template2 = "a1" x (length $string2); print &xor_o, "\n", &split_o, "\n", &unpack_o, "\n", &chop_o, "\n", &chop_o2, "\n"; my $results = timethese( 100000, { XOR => \&xor_o, SPLIT => \&split_o, UNPACK => \&unpack_o, CHOP => \&chop_o, CHOP2 => \&chop_o2 } ); cmpthese($results); sub xor_o { my $xor = $string1^$string2; return ($xor =~ tr/\0//)/length($string1); } sub split_o { my @string1 = split //, $string1; my @string2 = split //, $string2; my $score = 0; my $length = scalar @string1; for (my $i = 0; $i < $length; $i++) { $score += ($string1[$i] eq $string2[$i]); } return $score/$length; } sub unpack_o { my @string1 = (unpack $template1, $string1); my @string2 = (unpack $template2, $string2); my $score = 0; my $length = scalar @string1; for (my $i = 0; $i < $length; $i++) { $score += ($string1[$i] eq $string2[$i]); } return $score/$length; } sub chop_o { my @string1 = (); my @string2 = (); my $score = 0; my $rstring1 = scalar reverse $string1; my $rstring2 = scalar reverse $string2; push @string1, (chop $rstring1) while $rstring1; push @string2, (chop $rstring2) while $rstring2; my $length = scalar @string1; for (my $i = 0; $i < $length; $i++) { $score += ($string1[$i] eq $string2[$i]); } return $score/$length; } sub chop_o2 { my $str1 = $string1; my $str2 = $string2; my $length = length $string1; my $score; $score += (chop $str1 eq chop $str2) while $str1; return $score/$length; } Benchmark: timing 100000 iterations of CHOP, CHOP2, SPLIT, UNPACK, XOR +... CHOP: 43 wallclock secs (37.29 usr + 0.05 sys = 37.34 CPU) @ 26 +78.09/s (n=100000) CHOP2: 9 wallclock secs ( 7.34 usr + 0.03 sys = 7.37 CPU) @ 13 +568.52/s (n=100000) SPLIT: 47 wallclock secs (38.76 usr + 0.07 sys = 38.83 CPU) @ 25 +75.33/s (n=100000) UNPACK: 34 wallclock secs (29.51 usr + 0.03 sys = 29.54 CPU) @ 33 +85.24/s (n=100000) XOR: 2 wallclock secs ( 0.92 usr + 0.00 sys = 0.92 CPU) @ 10 +8695.65/s (n=100000) Rate SPLIT CHOP UNPACK CHOP2 XOR SPLIT 2575/s -- -4% -24% -81% -98% CHOP 2678/s 4% -- -21% -80% -98% UNPACK 3385/s 31% 26% -- -75% -97% CHOP2 13569/s 427% 407% 301% -- -88% XOR 108696/s 4121% 3959% 3111% 701% -- [download] Update: My apologies for the `length @array` lines above. That's what I get for yanking and putting. :) Thanks to those who caught it. New benchmarks are now shown. Update2: I've also added the chop2 implemented in the comment below.	[reply] [d/l] [select]
Re: Some of the above suggestions, benchmarked (Re: Similarity of strings) by jmcnamara (Monsignor) on May 15, 2002 at 23:38 UTC
I'm glad that someone benchmarked this. However, you were a little bit unfair to the chop method. `:-)` The scalar reverse and array assignments aren't necessary. The following is 5 times faster (although still 5 times slower than the xor method): `sub chop2 { my $str1 = $string1; my $str2 = $string2; my $length = length $string1; my $score; $score += (chop $str1 eq chop $str2) while $str1; return $score/$length; }` [download] Update: Albannach points out that because the strings in this test are not of equal length, the reverse is required. My code was based on the original sample data. Also, it is worth adding that the speed of the `xor` method is less dependent on the string length than the other methods. -- John.	[reply] [d/l]
Re: Similarity of strings by Sidhekin (Priest) on May 15, 2002 at 14:48 UTC
substr() is pretty fast, but I guess this will be faster. I think :-) `return map {tr/\0/\0/ / $len} $ref_seq ^ $test_seq;` [download] Update: If you are using the return value in scalar context, the above will yield 1, of course. Sorry about that. Try this: `my($score) = map {tr/\0// / $len} $ref_seq ^ $test_seq; return $score;` [download] ... or even (might be faster with many short strings): `return +(map{tr/\0// / $len} $ref_seq ^ $test_seq)[0];` [download] The Sidhekin `print "Just another Perl ${\(trickster and hacker)},"`	[reply] [d/l] [select]
Re: Re: Similarity of strings by professa (Beadle) on May 15, 2002 at 15:34 UTC
This code does not work for me, I get 100% similarity for each strings compared. But I'm not bright enough at the moment to see where's the bug here... Micha	[reply]
Re: Similarity of strings by Beatnik (Parson) on May 15, 2002 at 14:48 UTC
You can always try Text::Soundex but you should split your strings up in smaller bits for it to be effective... `use Text::Soundex; if (soundex($string1) eq soundex($string2)) { print "Bingo!" } else { +print "bummer"; }` [download] String::Approx seems a good solution too... perhaps a better one than Text::Soundex sniff Greetz Beatnik ... Quidquid perl dictum sit, altum viditur.	[reply] [d/l]
Re: Similarity of strings by jmcnamara (Monsignor) on May 15, 2002 at 15:13 UTC
I'd guess that the `xor` method is probably quicker but you could benchmark these as well: `# Method 1 my @a1 = split //, $ref_seq; my @a2 = split //, $test_seq; $score += ($a1[$_] eq $a2[$_]) for 0 .. $#a1; # Method 2, Destructive $score += (chop $ref_seq eq chop $test_seq) while $ref_seq;` [download] -- John.	[reply] [d/l]
Re: Re: Similarity of strings by professa (Beadle) on May 15, 2002 at 15:31 UTC
`chop` is a bit faster than `substr`, it needs ~5 seconds instead of 7 seconds. Thanx, Micha	[reply] [d/l] [select]
Re: Similarity of strings by Anonymous Monk on May 15, 2002 at 22:49 UTC
How did you produce your alignment? Did you use BLAST? http://www.ncbi.nlm.nih.gov/blast" Sequence alignment tools calculate the %identity, %similiarity, %gaps of the alignments they produce. One can then use a parser http://sapiens.wustl.edu/~ikorf/BPlite.html to get at the information you want.	[reply]
Re: Re: Similarity of strings by professa (Beadle) on May 16, 2002 at 10:31 UTC
Yep, I used BLAST, followed my mview (http://mathbio.nimr.mrc.ac.uk/~nbrown/mview/) to prepare an alignment from the BLAST-report. But MVIEW doesn't provide the similarity stuff (but does provide e-value, score, and some other things) and I don't have the guts to parse the complete BLAST-report (which can be very big sometimes) just to get the similarity. I prefer to check only those sequences which are of interest to me directly. Cheerz, Micha	[reply]
Re: Re: Re: Similarity of strings by ariels (Curate) on May 16, 2002 at 10:38 UTC
Perhaps bioperl has a BLAST parser you could use? (I've never used their parsers, so I cannot comment further.)	[reply]
Re: Re: Re: Re: Similarity of strings by professa (Beadle) on May 16, 2002 at 12:02 UTC


No such thing as a small change
	PerlMonks