http://qs1969.pair.com?node_id=78714

Segfault has asked for the wisdom of the Perl Monks concerning the following question:

This is probably a pretty newbie-ish question, but I've tried everything I can think of so far for this problem, and I've decided to beg for the help of the great monks. ;)

Basically what I'm doing is, I want to have a function that will take two strings and compare them, returning a percentage of differences between string A and string B. This is pretty easy if you assume they will be the same length and simply want to compare character by character, but I'll be working with strings of varying lengths.

For example, if string A is "this is a really annoying piece of text" and string B is "a really annoying piece of text" a character-by-character comparison would work very poorly for indicating how similar the strings are for the most part.

Anyway, I was wondering what might be good approaches for doing this sort of comparison, so that I can get fairly accurate ideas of how one string relates to another in this project I'm working on.

Thanks in advance for any help

Replies are listed 'Best First'.
Re: Comparing Strings
by no_slogan (Deacon) on May 08, 2001 at 03:26 UTC
    The String::Approx package can calculate the "edit distance" (number of edits to change one string to another).
    use String::Approx 'adist'; $dist = adist("pattern", $input);
Re: Comparing Strings
by runrig (Abbot) on May 08, 2001 at 03:25 UTC
Re: Comparing Strings
by ezekiel (Pilgrim) on May 08, 2001 at 03:48 UTC

    Your problem is very similar to protein sequence homology searches. A protein can be represented as nothing more than a string from an alphabet of 20 letters. Sequence homology searches (which are crucial to biology and bioinformatics) basically attempt to find and score similarities between two or more such sequences.

    Various algorithms exist to do this e.g. Needleman and Wunsch (Journal of Molecular Biology 48 pp443) and the ever popular BLAST www.ncbi.nlm.nih.gov/blast Of course these solutions are designed for molecular biology and would require a lot of work to alter them to handle general strings. My guess is you are looking for a simpler solution...

Re: Comparing Strings
by Segfault (Scribe) on May 12, 2001 at 21:11 UTC
    Thanks very much for the help, I owe you guys. ;)