String matching idea

baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

This is a post regarding potential ideas so no code required. I am just trying to see if I forgot some obvious solution.

Problem:
Given two strings with the same prefix and the same length:

aaababbbababbbabababb
aaababbbabaaababbabba
[download]

What would be the fastest way (the least number of computational steps) to identify the length of the shared prefix between two strings. An obvious solution is to just start pairwise matching of characters until a mismatch is located. But is there a way to preprocess this particular string in order to reduce the number of pairwise comparisons. Also given a large number of such cases what would be a better solution then to just pairwise compare strings? any suggestion is more than welcomed (code not required)
thnx
baxy

Comment on String matching idea Download Code

Replies are listed 'Best First'.
Re: String matching idea by choroba (Cardinal) on Sep 18, 2015 at 12:06 UTC
My idea: xor the strings, find the position of the first non-null character in the result. Update: no code was requested, so using the spoiler tag: <Reveal this spoiler or all spoilers in this node or all in this thread> If your strings only contain "a" and "b", you can use <Reveal this spoiler> لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l] [select]
Re: String matching idea by Tux (Canon) on Sep 18, 2015 at 12:24 UTC
Longest Common Substring String::LCSS LCSS (LCSS perl) LCSS and many many more. Google on `perl find longest common substring`. Enjoy, Have FUN! H.Merijn	[reply] [d/l]
Re^2: String matching idea by salva (Canon) on Sep 18, 2015 at 12:31 UTC
Note that the OP is asking for the common prefix, a much simpler problem than the LCSS one!	[reply]
Re^3: String matching idea by Tux (Canon) on Sep 18, 2015 at 13:41 UTC
I agree, but all those show a wealth of approaches. Finding the one that has the least OP's on the specific case of the OP on their system is left to the reader. In some cases some OP's might be faster on one architecture. Enjoy, Have FUN! H.Merijn	[reply]
Re: String matching idea by hippo (Archbishop) on Sep 18, 2015 at 12:11 UTC
My first approach for 2 strings would be a binary search. If each string is of length L, start by comparing the substrings from 0 to L/2. Then, depending on whether they match, either compare the substrings to L/4 or 3L/4, etc. This should give the correct result in log_2(L) steps, but bear in mind that each step requires 2 substring ops which may not be particularly cheap. For the large number of pairs of strings consisting of only 2 chars "a" and "b" I'd think about assigning them into hash bins based on initial char sets. Each set you create should halve the number of other operations.	[reply]
Re: String matching idea by salva (Canon) on Sep 18, 2015 at 12:29 UTC
For the 1-to-1 case, just do the obvious thing, there are no shortcuts for that problem. For the N-to-N case, use a trie or prefix tree.	[reply]
Re^2: String matching idea by Anonymous Monk on Sep 18, 2015 at 18:32 UTC
Please indulge my curiosity - what do you consider "the obvious thing" ?	[reply]
Re^3: String matching idea by salva (Canon) on Sep 19, 2015 at 07:06 UTC
Oh, well, I meant doing the obvious thing in C (or any other low-level enough language): comparing the characters one by one sequentially until a divergence is found. If you limit the solution to Perl, as it doesn't provide some builtin that could do that, we go into the land of tricks, as the solution posted by choroba above.	[reply]
Re: String matching idea by Anonymous Monk on Sep 18, 2015 at 17:52 UTC
#!/usr/bin/perl # http://perlmonks.org/?node_id=1142400 use strict; use warnings; # two strings my $s1 = 'aaababbbababbbabababb'; my $s2 = 'aaababbbabaaababbabba'; ($s1 ^ $s2) =~ /\0/; print $+[0], "\n"; # many strings (assumes no \n in strings) my @many = qw( aaababbbababbbabababb aaababbbabaaababbabba aaababababaaababbabba aaababbbabaaaaabbabba ); join("\n", @many, '') =~ /^(.).\n(?:\1.\n)$/; print length $1, "\n"; # also solves original problem :) @many = qw( aaababbbababbbabababb aaababbbabaaababbabba ); join("\n", @many, '') =~ /^(.).\n(?:\1.\n)*$/; print length $1, "\n"; [download] As far as "fastest way", see Benchmark.pm :)	[reply] [d/l]