http://qs1969.pair.com?node_id=281280

BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

I'm looking for algorithms and/or implementations for comparing two strings and determining if they are the same barring a single transposition of two adjacent characters.

comp( 'foo', 'ofo' ); # True comp( 'foo', 'foo' ); # False comp( 'abcde', 'bacde' ); # True comp( 'abcde', 'cabde' ); # False

If the method reports the position of the transposition as well as yes/no to the first question thats a bonus.

Efficiency is paramount.

Thanks


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
If I understand your problem, I can solve it! Of course, the same can be said for you.

Replies are listed 'Best First'.
Re: Detecting transpositions
by Abigail-II (Bishop) on Aug 06, 2003 at 08:15 UTC
    This returns -1 if the answer is negative.
    sub comp { my ($f, $s) = @_; return -1 if length ($f) != length ($s); local $_ = $f ^ $s; return -1 unless tr/\x00/1/c == 2; index $_ => "11"; }

    Abigail

      Fixed:

      sub comp { my ($f, $s) = @_; return -1 if length ($f) != length ($s); local $_ = $f ^ $s; return -1 unless tr/\x00/1/c == 2; my $i = index $_ => "11"; return -1 unless $i >= 0; return -1 unless substr($f, $i, 2) eq reverse substr($s, $i, 2); return $i; }
      -sauoq
      "My two cents aren't worth a dime.";
      
      That's cool, but it just looks for two consecutive differences, not for transposed characters. It gets confused comparing the strings "ab" and "bc", for example---it thinks they're transposed at character 0. That's probably fixable, but I'll have to think for a minute before I have a suggestion.
        The examples suggested that both strings were anagrams of each other. However, here's a fix:
        sub comp { my ($f, $s) = @_; return -1 if length ($f) != length ($s); local $_ = $f ^ $s; return -1 unless tr/\x00/1/c == 2; my $r = index $_, "11"; (substr ($f, $r, 2) eq reverse substr ($s, $r, 2)) ? $r : -1; }

        Abigail

Re: Detecting transpositions
by dws (Chancellor) on Aug 06, 2003 at 07:15 UTC
    I'm looking for algorithms and/or implementations for comparing two strings and determining if they are the same barring a single transposition of two adjacent characters.

    Have you considered either using or borrowing code from Algorithm::Diff? That's how I'd start.

      I hadn't actually considered Algorithm::Diff as I had assumed it to be line oriented. A brief scan of the docs show that the basic detection method employed is Longest Common Sequence, which isn't really applicable to my problem.

      I've also looked at String::Approx, Text::Levenshtien and Text::Soundex, but again, these aren't geared to detecting a single transposition. They are also incredibly slow if you are trying to do this on thousands of strings.

      I've also coded a simple looping comparison, and played with XOR, but I didn't want to influence the ideas. I have a vague memory of there being a clever way of doing this, but searching the web didn't elicite anything from the vague memory I have.

      So, I thought I'd ask:)


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
      If I understand your problem, I can solve it! Of course, the same can be said for you.

        You can try Text::LevenshteinXS for the need for speed.

        You can also try Text::WagnerFischer and Text::Brew to play with operation's weights/costs to detect letters swap.

        There are also Text::PhraseDistance that is suited for phrases but that can benefit from a custom distance. The 0.01 version has also a dinamic technique (slow), instead the 0.02 solves the issue for the marriage problem.

        </advertising>
Re: Detecting transpositions
by sauoq (Abbot) on Aug 06, 2003 at 08:59 UTC
    comp( 'foo', 'foo' ); # False

    So, how do you know those 'o's aren't transposed? ;-)

    Anyway, here's my attempt. I think it'll work correctly in boolean and numeric contexts. I return either undef or a string but you might prefer to return either undef or an integer (which will force you to check for undef.) Or maybe undef or an array with one element (which will work in boolean context even if the match is at the zeroth position.) Or you might want to start your positions at one... You weren't very specific about that.

    sub comp { return undef unless length $_[0] == length $_[1]; my $where; for ( 0 .. length $_[0] ) { next if substr($_[0], $_, 2) eq substr($_[1], $_, 2); if ( substr($_[0], $_, 2) eq reverse substr($_[1], $_, 2) ) { return undef if defined $where; $where = $_; } else { return undef; } } return "+$where" if defined $where; }

    I imagine there are ways to do this better. This method is limited to N comparisons of 2 character strings for successful comparisons and less on failures. It might fare pretty well if you expect failures most of the time. What kind of input do you expect?

    -sauoq
    "My two cents aren't worth a dime.";
    

      Okay. The strings will always be the same length (should have metioned that. Saves a test).

      The strings will always be anagrams of each other, as implied but not stated.

      The 'characters' of each string represent a sequence of states in the order that they are to be transitioned through.

      The bigger picture is that I have a set of strings--always much less than the potential set of anagrams, but still potentially quite large--each describing an allowable sequences of transitions.

      The task is to try and find a single sequence (if possible) of strings (sequences), that allows me to move through each of the sequence of states with only one pair of states (chars) changing at each step. Rather like Grey codes, but using bytes instead of chars.

      That requires picking a starting sequence (string), then comparing that against each of the other strings looking for one that can be achieved with a single transposition. Then using that as the base, compare that against each of the remaining strings looking for the next 'match' and so on until the sequence of strings is complete.

      If I reach a point where there is no next string, backtrack and look for a second match at the previous level, and move forward again. If all attempts to complete the sequence fail, put the original start point at the end and start again from the new start point and repeat until a sequence is found or all possiblilities have been attempted. I think that makes it an O( n * n! ) search?, hence the need for the comparison to be as fast as possible. I think Abigails original attempt will suffice for my needs and is probably the quickest, though I haven't run the bbenchmark yet. I will once I have generated some realistic sets of test data.

      I think Abigail-IIs Backtracking through the regex world technique of using the regex engine to control the backtracking might be applicable here, and I was hoping for a regex solution to the comparision problem. Rather than using a code block, I thought that I might be able to use the (?(condition)Yespattern|No patterns) to decide when to backtrack or not, but I couldn't think of a regex. I can still try it with a code block, but for now it's looking more like a job for a recursive function.

      Too many words, but thats the best description I can come up with for the picture in my head. I don't have any code that comes close yet, and showing the strings would be useless as the states are represented by char values < 32, which would just display as garbage. I could possibly change that to use [0-9A-Z] but skipping over the punctuation chars would be a pain.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
      If I understand your problem, I can solve it! Of course, the same can be said for you.

        That sounds like O (n * n!) indeed. It looks like you are trying to find a hamiltonian path through a graph where the nodes of the graph correspond to the string in your set, and there's an edge between a pair of nodes if, and only if, the corresponding strings differ by a single transposition.

        The general problem of finding an hamiltonian path is NP-complete, but for specific graphs it might be easier.

        Abigail

Re: Detecting transpositions
by sgifford (Prior) on Aug 06, 2003 at 07:52 UTC
    At first blush, this seems to work. Not sure about the performance; I'll poke at it a bit more. It returns the position of the match, which can be 0, so you have to test if the result is defined.
    sub comp { my($s1,$s2)=@_; my @s1 = split('',$s1); my @s2 = split('',$s2); my $transpose; for (my $i=0;$i<@s1;$i++) { if ($s1[$i] ne $s2[$i]) { if ($transpose) { return undef; } # Uh-oh; they differ. Is this a transposition? if ( ($s1[$i+1] eq $s2[$i]) && ($s1[$i] eq $s2[$i+1])) { $transpose = $i; $i++; } } } return $transpose; }
Re: Detecting transpositions
by sgifford (Prior) on Aug 06, 2003 at 08:48 UTC
    Here's another version that uses String::DiffLine. It's about 3 times faster than my previous post, according to Benchmark.
    sub comp2 { return undef if (length($_[0]) != length($_[1])); my $i = diffline($_[0],$_[1]); if ( ($i < length($_[0])) && (substr($_[0],$i+1,1) eq substr($_[1],$i,1)) && (substr($_[0],$i,1) eq substr($_[1],$i+1,1)) && (substr($_[0],$i+2) eq substr($_[1],$i+2)) ) { return $i; } return undef; }

    String::DiffLine needs some tweaking to compile (change sv_undef to PL_sv_undef in the .xs file), but it seems to be quite fast.

Re: Detecting transpositions
by dws (Chancellor) on Aug 06, 2003 at 16:30 UTC
    Efficiency is paramount.

    Have you considered precomputing transpositions?

    Then, comp(A, B) becomes

    • lookup A' (the set of transpositions for A)
    • determine if B is a member of A'

    If everything is in a big hash (i.e., if you've traded space for time), the lookup is quick. You could even do a lazy initialization of the transposition set.

      That really helps. Kind of like a lazy evaluating ST.

      The problem remains O9n*n!), but for n-1 * n! passes the comp() comes down to

      return $cache{$_[0]}{$_[1]} if exists $cache{$_[0]}{$_[1]};

      Which is a huge time saver (provided I don't run out of memory:).

      Thanks.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
      If I understand your problem, I can solve it! Of course, the same can be said for you.

      shouldn't you sum up the weight of every character first to compare the result and know if they have the same ones?
Re: Detecting transpositions
by Skeeve (Parson) on Aug 06, 2003 at 08:28 UTC
    print comp( 'foo', 'ofo' ) ? "True\n" : "False\n"; print comp( 'foo', 'foo' ) ? "True\n" : "False\n"; print comp( 'abcde', 'bacde' ) ? "True\n" : "False\n"; print comp( 'abcde', 'cabde' ) ? "True\n" : "False\n"; sub comp { my($x,$y)= @_; my($xreg)=join('|', grep $_,map {$1.$3.$2.$4 if $x=~ /^(.{$_})(.)(.) +(.*)/ && $2 ne $3} 0..length $x); return $y=~ /^$xreg$/; }
    Update: Forget it ;-/ Abigail's is MUCH better! ++ for her him

    Update: Thanks to dragonchild and perlguy for telling me that Abigail-II is a he ;-)