miketosh has asked for the wisdom of the Perl Monks concerning the following question:

Need some way to compare the following strings:

/home/usernames/doejohnwilson
/home/usernames/doejanemary

And output the similar parts, left anchored:
/home/usernames/doej


This is basically trying to emulate the TAB (bash and csh) or ESC-\ (ksh) autocomplete for directory names.

Ideas? Thoughts?
  • Comment on Print left-anchored similarities between two strings

Replies are listed 'Best First'.
Re: Print left-anchored similarities between two strings
by Corion (Patriarch) on Nov 10, 2009 at 15:39 UTC

    String-xor (well, rather ^) to the rescue:

    my @strings = qw( /home/usernames/doejohnwilson /home/usernames/doejanemary ); my $common = $strings[0] ^ $strings[1]; # every \x00 means the chars a +re identical my $common_length = 0; if ($common =~ /^(\x00+)/) { $common_length = length $1; }; print substr $strings[0], 0, $common_length;

      You can save a couple of steps:

      my @strings = qw( /home/usernames/doejohnwilson x/home/usernames/doejanemary ); my $common_length = ( $strings[0] ^ $strings[1] ) =~ /^\0*/ && $+[0]; print substr $strings[0], 0, $common_length;
        Awesome!

        I think this will work for me (since I can have an unspecified number of elements in an array...)

        # our $input is user inputted string # and I only want directory/ names our @matches = map { -d $_? $_ .="/" : "" } <$input*>; expand_match || print "\a"; sub expand_match { return $input = $matches[0] if(@matches == 1); # Get the longest matching string from all elements my $match = $matches[0]; for (@matches) { $match = substr $_, 0, ( $match ^ $_ ) =~ /^\0*/ && $+[0]; } return $input = $match if length($match) > length($input); return 0; }
      perldoc -f rindex
      my @strings = qw( /home/usernames/doejohnwilson /home/usernames/doejanemary ); my $common = $strings[0] ^ $strings[1]; print substr $strings[0], 0, rindex($common,"\x00"); __END__

        Careful. There can be matching characters after the first non-matching character!


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Print left-anchored similarities between two strings
by almut (Canon) on Nov 10, 2009 at 17:22 UTC

    The "standard" way to compute the longest left-anchored common substring would be to compare character for character of both strings until they differ. However, in Perl it's relatively cumbersome (compared to C for example) to access a single character in a string (via the substr builtin), so this approach is quite slow (even for rather short strings like your sample).

    For comparison, here's the results of a quick benchmark, so you'll actually appreciate the performance of the XOR method suggested by Corion:

    #!/usr/bin/perl use Benchmark qw(cmpthese); use Inline C; my $s1 = '/home/usernames/doejohnwilson'; my $s2 = '/home/usernames/doejanemary'; cmpthese(-1, { 'standard' => sub { my $pos = 0; $pos++ while substr($s1, $pos, 1) eq substr($s2, $pos, 1); my $result = substr($s1, 0, $pos); # die unless $result eq "/home/usernames/doej"; }, 'fancy-XOR' => sub { my $common = $s1 ^ $s2; my $common_length = 0; if ($common =~ /^(\x00+)/) { $common_length = length $1; }; my $result = substr($s1, 0, $common_length); # die unless $result eq "/home/usernames/doej"; }, 'standard-XS' => sub { my $common_length = left_anchored_common_substr_len($s1, $s2); my $result = substr($s1, 0, $common_length); # die unless $result eq "/home/usernames/doej"; } }); __END__ __C__ int left_anchored_common_substr_len(char* s1, char* s2) { int c = 0; while (*s1++ == *s2++) { c++; } return c; }

    Results:

    Rate standard fancy-XOR standard-XS standard 109227/s -- -77% -92% fancy-XOR 468114/s 329% -- -64% standard-XS 1291788/s 1083% 176% --

    Update: added an XS routine (via Inline::C) that implements the aforementioned "standard" approach in C.