Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Challenge: Fast Common Substrings

by Limbic~Region (Chancellor)
on Apr 04, 2007 at 01:01 UTC ( #608174=perlquestion: print w/replies, xml ) Need Help??

Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

Tonight on #perl in freenode, a user asked "What is the fastest way to find the number of substrings two strings have in common?"

The user had working code but wanted to know if it could be made faster without relying on Inline::C. We asked a number of clarifying questions to flesh out the requirements:

  • Only the count is required, not the substrings
  • A substring that matches in multiple places counts only once
  • The substring length is user defined
  • Substrings, as defined by the requester, are contiguous. All substrings of 'ABC' are A, B, C, AB, BC, ABC
My idea was to generate an unpack template programmatically that would return all the substrings at once. It would then be a simple matter of returning the count of intersecting substrings. It seemed like this could be done without any explicit loops. My solution follows:

It seemed to me that the challenge would be much more interesting if the substring length were allowed to be a range so that's worth bonus points. What is the fastest solution you can come up with in pure perl?

Update: Added definition of substring as well as minor code change assigning hash slice to empty list (f00li5h++).

Update: 2007-04-04 08:25 EST - Thanks to everyone who has contributed so far. I will be posting a Benchmark after work so that folks who want to optimize for a given data set may.

Update: 2007-04-05 08:14 EST - Two days after scheduling to transfer my service to another provider, my ISP mysteriously starts having problems with my account - coincidence? I would appreciate it if someone could post a Benchmark assuming 5,000 pairs of strings ranging in length from 30 to 50 lowercase letters with a desired substring length ranging from 3 to 7.

Cheers - L~R

Replies are listed 'Best First'.
Re: Challenge: Fast Common Substrings
by blokhead (Monsignor) on Apr 04, 2007 at 02:04 UTC
    Here's just a naive solution using regexes. I have no idea if it's fast. On one hand, the regex engine is doing all the work; on the other hand, in the second code snippet, it's just brute-forcing the problem. The match_all_ways sub is taken from Re^3: Delimited Backtracking with Regex.
    { my @matches; my $push = qr/(?{ push @matches, $1 })/; sub match_all_ways { my ($string, $regex) = @_; @matches = (); $string =~ m/($regex)$push(?!)/; return @matches; } } sub common_substr { my ($str1, $str2, $len) = @_; my %substr = map { $_ => 1 } match_all_ways($str1 => qr/.{$len}/); $substr{$_} |= 2 for match_all_ways($str2 => qr/.{$len}/); return grep { $substr{$_} == 3 } keys %substr; } print "$_\n" for common_substr("ABCDEF", "ABDEFCBDEAB", 2); __END__ DE EF AB
    Returns all the common substrings, since it might as well.. In scalar context returns the number.

    Update: Or, if you want the regex engine to do all of the work, instead of computing the intersection of the two lists in perl:

    { my @matches; my $push = qr/(?{ push @matches, $1 })/; sub match_all_ways { my ($string, $regex) = @_; @matches = (); $string =~ m/$regex$push(?!)/; return @matches; } } sub common_substr { my ($str1, $str2, $len) = @_; my %subs; @subs{ match_all_ways("$str1\0$str2" => qr/(.{$len}).*\0.*\1/) } + = (); return keys %subs; }
    Note that the match_all_ways sub was changed slightly (to account for the different capturing). Disclaimer: If input strings contain a newline or null character, or if $len > 65536, it doesn't work.. but I think you get the idea.

    These solutions are both trivial to extend to a range of lengths.. just pass "n,m" as the $len argument.


Re: Challenge: Fast Common Substrings
by BrowserUk (Patriarch) on Apr 04, 2007 at 02:54 UTC

    Here's my best attempt so far:

    sub nCommonSubstrLenL { my( $haystack, $needle, $len ) = @_; ( $haystack, $needle ) = ( $needle, $haystack ) if length( $haystack ) < length( $needle ); # Added my $count = 0; my %possibles; for my $ni ( 0 .. length( $needle ) - $len ) { my $possible = substr( $needle, $ni, $len ); next if ++$possibles{ $possible } > 1; ++$count if 1+index $haystack, $possible; } return $count; }

    Update: A slightly faster reformulation. Updated again to work for lengths other than 2.

    sub nCommonSubstrLenL2 { my( $haystack, $needle, $len ) = @_; ( $haystack, $needle ) = ( $needle, $haystack ) if length( $haystack ) < length( $needle ); ## Added. # my $pattern = "A$len X" x int( length( $needle ) / $len ); my $pattern = (" A$len" . 'X' x ( $len-1 )) x (length( $needle ) - + $len +1); my $count = 0; my %possibles; for my $possible ( unpack $pattern, $needle ) { next if ++$possibles{ $possible } > 1; ++$count if 1+index $haystack, $possible; } return $count; }

    And buking for the bonus, a not well tested version that looks for all common substring equal or greater in length than the user specifed parameter:

    sub nCommonSubstrGreaterLenL { my( $haystack, $needle, $len ) = @_; my $count = 0; my %possibles; for my $l ( $len .. length( $needle ) ) { my $sofar = $count; for my $ni ( 0 .. length( $needle ) - $l ) { my $possible = substr( $needle, $ni, $l ); next if ++$possibles{ $possible } > 1; ## print( "$possible : $count" ), ++$count if 1+index $haystack, $possible; } last unless $count > $sofar; } return $count; } print 'Buk:', nCommonSubstrGreaterLenL 'ABCDEF','ABDEFCBDEAB', 2; __END__ AB : 0 DE : 1 EF : 2 DEF : 3 Buk:4

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Challenge: Fast Common Substrings
by thezip (Vicar) on Apr 04, 2007 at 05:04 UTC

    I throw my hat into the ring with my recursive implementation. I think it could perform reasonably well, since it is a divide-and-conquer type solution.

    I'm not sure how it will stand up to the hash-based solutions, and there might be some "correctness" issues...

    For the bonus, though, mine increments the count for *any* length of matching substrings.

    There are probably many opportunities for optimizations here... please offer criticism.

    #! perl -w use strict; my $hits = 0; sub common_substr { my($s1, $s2) = @_; print qq(s1 = $s1, s2 = $s2\n); ($s1, $s2) = ($s2, $s1) if length($s2) < length($s1); if ($s1 eq $s2) { $hits++; return if length($s1) == 1; } my %hash = map { $_ => 1 } split(//, $s1); my $arr = []; for my $s (split(//, $s2)) { push(@$arr, $s) if ! exists($hash{$s}); } my $splitters = join('|', @$arr); for my $s (split(/$splitters/, $s2)) { common_substr($s, $s1); } } common_substr("abcef", "abcdef"); print "hits = $hits\n";

    Where do you want *them* to go today?

      OK, I concede to the Suffix Tree solution presented by lima1 ++.

      I suspect the best run-time order I could muster is O(nlogn), and worst O(n²).

      It was still a fun diversion nonetheless... :-)

      Where do you want *them* to go today?

        I wouldn't give up yet. You need to see an implementation and how it benchmarks.

        Implementing Suffix Trees in pure perl is messy, memory expensive (hugely so in my attempts), and rarely works out more efficient. Don't forget that it's O(n) to build as well as O(m) to use. If you are only using it once, that snips into the benefits.

        For short strings a linear search (at C speed) and no build time wins easily. For longer strings, the indexing a suffix tree provides is a winner, but it gets set back awfully by the build time and memory management. If you only use the tree the once (as for this challenge), there little or no win.

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Challenge: Fast Common Substrings
by lima1 (Curate) on Apr 04, 2007 at 15:50 UTC
    Just for the sake of completeness: A fast and elegant algorithm for this is a tricky use of suffix trees. One concatenates the two strings of length n and m, say abcdef%efgab$. It is possible to construct a suffix tree of this string in O(n+m) (Ukkonen algorithm). To find the common substrings, one has then to search for nodes that have exactly two (or the number of strings) leafs belonging to the different words. The resulting suffix tree for "abcdef" and "efgab":
    | |(3:cdef%efgab$)|leaf |(1:ab)| | |(13:$)|leaf tree:| | |(3:cdef%efgab$)|leaf |(2:b)| | |(13:$)|leaf | |(3:cdef%efgab$)|leaf | |(4:def%efgab$)|leaf | | |(7:%efgab$)|leaf |(5:ef)| | |(10:gab$)|leaf | | |(7:%efgab$)|leaf |(6:f)| | |(10:gab$)|leaf | |(7:%efgab$)|leaf | |(10:gab$)|leaf |
    So "ab" has two leafs in the different words (position <= 7 for leaf 1 and position > 7 for leaf 2). So have 'b', 'ef' and 'f'.

    Update: Just found some perl code with google ... on perlmonks ;) Re: finding longest common substring

      ++ Wow, thank you for introducing me to suffix trees. What an interesting concept, and how refreshing to see a linear-time algorithm for constructing such a creature. I see you've used the javascript applet at this page, which others may want to check out.

      However, I'd like to slightly revise the algorithm you outlined. Consider the following example:

      string = ababc%bc$ | |(3:abc%bc$)|leaf |(1:ab)| | |(5:c%bc$)|leaf tree:| | |(3:abc%bc$)|leaf |(2:b)| | | |(6:%bc$)|leaf | |(5:c)| | | |(9:$)|leaf | | |(6:%bc$)|leaf |(5:c)| | |(9:$)|leaf | |(6:%bc$)|leaf | |(9:$)|leaf
      "ab" appears twice in the first string, and so it gives a node with two leaves. The actual condition you should check is whether a node has one leaf containing the % separator and another leaf without the % symbol.


        The page you link to mentions being able to build them in O(n) but then only really describes how to go from a suffix tree for string $x to one for string $x.$c (1==length$c) in O(length $x). Using that algorithm would require O(N*N) to build the suffix tree for a string of length N.

        So I'm not sure I believe the O(N) claim for building the whole suffix tree based on that page.

        - tye        

        Or even easier: check the positions of the substrings (<=7 and > 7 in my example).
Re: Challenge: Fast Common Substrings
by eric256 (Parson) on Apr 05, 2007 at 00:03 UTC

    I dunno about speed, but its the only version i can actualy understand so far ;) and who knows, it might not be as slow as you think...probably will be..but hey! I think the second loop through s2 could probably have some short circuits added where it knows the sub string it is on doesn't occur in the first string..but it's time to go home!

    use strict; use warnings; use Data::Dumper; sub common_sub { my ($s1,$s2,$len) = @_; my $len_s1 = length($s1); my $len_s2 = length($s2); my $match_s1 = {}; my $match_s2 = {}; for my $start (0..length($s1)-1) { for my $l (1..$len) { next if $start+$l > $len_s1; $match_s1->{substr($s1, $start, $l)} ||= 1; } } for my $start (0..length($s2)-1) { for my $l (1..$len) { next if $start+$l > $len_s2; $match_s2->{substr($s2, $start, $l)} ||= 1; } } $match_s1->{$_}++ for keys %$match_s2; return grep { $match_s1->{$_} == 2 } keys %$match_s1; } print join(",", common_sub("ABCD", "BCD",2));

    Eric Hodges
Re: Challenge: Fast Common Substrings
by BrowserUk (Patriarch) on Apr 05, 2007 at 15:23 UTC

    Okay here is a benchmark as requested. I doubt it will satisfy everyone.

    Some comments on the benchmark and results obtained.

    1. Randomly generated strings of 30 .. 50 lower case characters rarely if ever produce common substrings of 3 characters, never mind longer.

      To this end I used 'A'..'D', which produces a few at 3, 4, & 5 characters long, but never longer that I saw.

    2. Even adjacent pairings of 5000 strings takes a very long time to run, never mind a full combinatorial pairing.

      I used 500 strings instead. It's a command line parameter, if you've the time and inclination to run 5000, go ahead. I doubt it will make any huge difference to the outcome.

      theZip's (TZ) code is horrible to benchmark as it relies upon a global var. Ie. The sub does not return the result.

      I had a half hearted attempt to address this using a helper sub closing over the global, but the result are wildly inaccurate. Possibly? the fault of my adaption. Sorry thezip.

      However, it seems to be far and away the slowest algorithm anyway.

    3. eric256's (eric) code produces numbers that seem not to relate to the inputs as far a I can discern.

      Maybe I screwed up, but I don't think so. It's also the second slowest algorithm.

    4. Mine (Buk), blokhead's (BH1/BH2), moron's (MN), and Limbic~Region's (LR) code all produce the same counts (although Limbic~Region's has been seen to be one shy of the others on a few occasions).

      The level of concurrence between these disparate algorithms and implementations is taken to indicate that they are producing the correct results. I have not manually verified them.

    Results (500 random strings of 'A'..'D', tested each against the next for common substrings 3-7 characters

    The benchmark code. CLI parameters are: -N=nn numbers of strings to generate; -LENGTH=mm: length of common substrings to look for; Interesting challenge Limbic~Region, thanks.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

        /me benchmarks agian with his in there as a matter of pride.

        C:\Perl\test> -N=1000 -LENGTH=2 Rate BH2 BH1 MN Eric L~R Buk BH2 3.37/s -- -20% -53% -70% -78% -87% BH1 4.21/s 25% -- -42% -62% -72% -84% MN 7.22/s 114% 71% -- -35% -52% -73% Eric 11.1/s 231% 164% 54% -- -26% -58% L~R 15.1/s 347% 258% 109% 35% -- -44% Buk 26.7/s 694% 534% 270% 140% 77% -- ---- BH2 => 14 L~R => 12 Eric => 14 Buk => 14 BH1 => 14 MN => 14 C:\Perl\test> -N=1000 -LENGTH=3 Rate BH1 MN BH2 Eric L~R Buk BH1 3.88/s -- -23% -39% -46% -73% -80% MN 5.05/s 30% -- -21% -30% -65% -74% BH2 6.40/s 65% 27% -- -11% -56% -67% Eric 7.21/s 86% 43% 13% -- -50% -63% L~R 14.4/s 272% 185% 125% 100% -- -26% Buk 19.4/s 400% 284% 203% 169% 35% -- ---- BH2 => 9 L~R => 9 Eric => 9 Buk => 9 BH1 => 9 MN => 9 C:\Perl\test> -N=1000 -LENGTH=4 Rate BH1 MN Eric BH2 L~R Buk BH1 3.56/s -- -19% -43% -57% -74% -79% MN 4.39/s 23% -- -30% -47% -68% -74% Eric 6.22/s 75% 42% -- -25% -54% -63% BH2 8.35/s 135% 90% 34% -- -38% -51% L~R 13.5/s 281% 209% 118% 62% -- -20% Buk 16.9/s 375% 285% 171% 102% 25% -- ---- BH2 => 4 L~R => 4 Eric => 4 Buk => 4 BH1 => 4 MN => 4 C:\Perl\test> -N=1000 -LENGTH=5 Rate BH1 MN Eric BH2 L~R Buk BH1 3.82/s -- -14% -34% -59% -72% -78% MN 4.44/s 16% -- -24% -52% -67% -74% Eric 5.82/s 52% 31% -- -37% -57% -66% BH2 9.27/s 143% 109% 59% -- -32% -46% L~R 13.6/s 255% 205% 133% 46% -- -21% Buk 17.1/s 348% 285% 194% 85% 26% -- ---- BH2 => 0 L~R => 0 Eric => 0 Buk => 0 BH1 => 0 MN => 0 C:\Perl\test> -N=1000 -LENGTH=6 (warning: too few iterations for a reliable count) Rate BH1 MN Eric BH2 L~R Buk BH1 3.56/s -- -21% -38% -64% -73% -80% MN 4.50/s 27% -- -21% -54% -66% -74% Eric 5.73/s 61% 27% -- -42% -57% -67% BH2 9.84/s 177% 119% 72% -- -25% -43% L~R 13.2/s 271% 193% 130% 34% -- -24% Buk 17.4/s 389% 286% 203% 77% 32% -- ---- BH2 => 1 L~R => 1 Eric => 1 Buk => 1 BH1 => 1 MN => 1

        Eric Hodges


      I misread the definition of substring to mean all substrings of that length or lower...quick easy fix and it starts placing decent.

      sub common_sub { my ($s1,$s2,$len) = @_; my $len_s1 = length($s1); my $len_s2 = length($s2); my $match_s1 = {}; my $match_s2 = {}; for my $start (0..length($s1)-1) { #for my $l (1..$len) { next if $start+$len > $len_s1; $match_s1->{substr($s1, $start, $len)} ||= 1; #} } for my $start (0..length($s2)-1) { #for my $l (1..$len) { next if $start+$len > $len_s2; $match_s2->{substr($s2, $start, $len)} ||= 1; #} } $match_s1->{$_}++ for keys %$match_s2; return grep { $match_s1->{$_} == 2 } keys %$match_s1; }

      Eric Hodges
Re: Challenge: Fast Common Substrings
by Moron (Curate) on Apr 04, 2007 at 10:01 UTC
    # more update: this is the tested version now!! sub CountSubstrings { # $string1, $string2, $substr_length my $l = pop @_; my %found = (); my %match = (); my $first = 2; my $string1 = $_[0]; for my $string ( @_ ) { $first --; my $ls = length( $string ); my $limit = $ls - $l + 1; for ( my $i = 0; $i < $limit; $i++ ) { my $sbstr = substr( $string, $i, $l ); $first or defined ( $match{$sbstr} ) && next(); $found{ $sbstr }{ $string } ||= 1; $first and next;; defined ( $found{ $sbstr }{ $string1 } ) and $match{ $sbstr } = 1; } } Count( keys %match ); } sub Count { 1+$#_ };
    Description of alggorithm: Store all substrings of string 1 of the given length in subhash $found{ $string1 }; do the same for string 2 but in addition lookup in the first hash and record any matches in the %match hash. The suggested opportunities for optimisation include (a) only need to check matches for the substring iteration of the second string. (b) only need to build the substring for string 2 if there is no match found in the match hash.


    Free your mind

      # untested

      True. It doesn't even compile. And when you've corrected the numerous errors, it doesn't work. And it's hard to see how you thought it would. Care to enlighten? (And did you notice the word "fast" in the title?)

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        I think many people will see this is only an algorithm outline and will take the word "untested" as an approriate hint in that regard. I am multitasking at the moment.

        re "enlighten" - I think under the circumstances the most appropriate enlightment I can offer is to improve your lateral thinking In particular I recommend Six Thinking Hats. You would benefit from the insight that you tend to be stuck wearing a black hat (negative logic) and need to learn how to take it off at will and replace it with other forms of thinking when needed. If you need evidence - In your post you say on the one hand you find it difficult to see how I expect it to work and yet you go on to imply an ability to assess its performance when corrected. That's like saying the performance of anything in a category is X without realising the scope you have already accepted for that category, making it impossible to make any sensible such judgement. Conversely, all I am offering is a means to use hashes to solve the problem and I sincerely expect this to perform well when corrected given the pure simplicity of the algorithm and the opportunities it uses to cut out unnecessary iteration. I think you will find that many beginners out there will find it easy enough to correct this code and get a fast-performing result. So why not you, eh? You're clearly not nearly a beginner are you. So what else can it be ... ?


        Free your mind

          A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://608174]
Approved by planetscape
Front-paged by planetscape
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (2)
As of 2023-09-27 05:07 GMT
Find Nodes?
    Voting Booth?

    No recent polls found