Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:
The user had working code but wanted to know if it could be made faster without relying on Inline::C. We asked a number of clarifying questions to flesh out the requirements:
It seemed to me that the challenge would be much more interesting if the substring length were allowed to be a range so that's worth bonus points. What is the fastest solution you can come up with in pure perl?
Update: Added definition of substring as well as minor code change assigning hash slice to empty list (f00li5h++).
Update: 2007-04-04 08:25 EST - Thanks to everyone who has contributed so far. I will be posting a Benchmark after work so that folks who want to optimize for a given data set may.
Update: 2007-04-05 08:14 EST - Two days after scheduling to transfer my service to another provider, my ISP mysteriously starts having problems with my account - coincidence? I would appreciate it if someone could post a Benchmark assuming 5,000 pairs of strings ranging in length from 30 to 50 lowercase letters with a desired substring length ranging from 3 to 7.
Cheers - L~R
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Challenge: Fast Common Substrings
by blokhead (Monsignor) on Apr 04, 2007 at 02:04 UTC | |
Returns all the common substrings, since it might as well.. In scalar context returns the number. Update: Or, if you want the regex engine to do all of the work, instead of computing the intersection of the two lists in perl: Note that the match_all_ways sub was changed slightly (to account for the different capturing). Disclaimer: If input strings contain a newline or null character, or if $len > 65536, it doesn't work.. but I think you get the idea. These solutions are both trivial to extend to a range of lengths.. just pass "n,m" as the $len argument. blokhead | [reply] [d/l] [select] |
Re: Challenge: Fast Common Substrings
by BrowserUk (Patriarch) on Apr 04, 2007 at 02:54 UTC | |
Here's my best attempt so far:
Update: A slightly faster reformulation. Updated again to work for lengths other than 2.
And buking for the bonus, a not well tested version that looks for all common substring equal or greater in length than the user specifed parameter:
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
Re: Challenge: Fast Common Substrings
by thezip (Vicar) on Apr 04, 2007 at 05:04 UTC | |
I throw my hat into the ring with my recursive implementation. I think it could perform reasonably well, since it is a divide-and-conquer type solution. I'm not sure how it will stand up to the hash-based solutions, and there might be some "correctness" issues... For the bonus, though, mine increments the count for *any* length of matching substrings. There are probably many opportunities for optimizations here... please offer criticism.
Where do you want *them* to go today?
| [reply] [d/l] |
by thezip (Vicar) on Apr 04, 2007 at 20:45 UTC | |
OK, I concede to the Suffix Tree solution presented by lima1 ++. I suspect the best run-time order I could muster is O(nlogn), and worst O(n²). It was still a fun diversion nonetheless... :-)
Where do you want *them* to go today?
| [reply] |
by BrowserUk (Patriarch) on Apr 04, 2007 at 21:54 UTC | |
I wouldn't give up yet. You need to see an implementation and how it benchmarks. Implementing Suffix Trees in pure perl is messy, memory expensive (hugely so in my attempts), and rarely works out more efficient. Don't forget that it's O(n) to build as well as O(m) to use. If you are only using it once, that snips into the benefits. For short strings a linear search (at C speed) and no build time wins easily. For longer strings, the indexing a suffix tree provides is a winner, but it gets set back awfully by the build time and memory management. If you only use the tree the once (as for this challenge), there little or no win. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
Re: Challenge: Fast Common Substrings
by lima1 (Curate) on Apr 04, 2007 at 15:50 UTC | |
So "ab" has two leafs in the different words (position <= 7 for leaf 1 and position > 7 for leaf 2). So have 'b', 'ef' and 'f'. http://en.wikipedia.org/wiki/Longest_common_substring_problem Update: Just found some perl code with google ... on perlmonks ;) Re: finding longest common substring | [reply] [d/l] |
by blokhead (Monsignor) on Apr 04, 2007 at 16:02 UTC | |
However, I'd like to slightly revise the algorithm you outlined. Consider the following example: "ab" appears twice in the first string, and so it gives a node with two leaves. The actual condition you should check is whether a node has one leaf containing the % separator and another leaf without the % symbol. blokhead | [reply] [d/l] |
by tye (Sage) on Apr 04, 2007 at 21:53 UTC | |
The page you link to mentions being able to build them in O(n) but then only really describes how to go from a suffix tree for string $x to one for string $x.$c (1==length$c) in O(length $x). Using that algorithm would require O(N*N) to build the suffix tree for a string of length N. So I'm not sure I believe the O(N) claim for building the whole suffix tree based on that page. - tye | [reply] |
by lima1 (Curate) on Apr 04, 2007 at 22:12 UTC | |
by lima1 (Curate) on Apr 04, 2007 at 16:25 UTC | |
| [reply] |
Re: Challenge: Fast Common Substrings
by eric256 (Parson) on Apr 05, 2007 at 00:03 UTC | |
I dunno about speed, but its the only version i can actualy understand so far ;) and who knows, it might not be as slow as you think...probably will be..but hey! I think the second loop through s2 could probably have some short circuits added where it knows the sub string it is on doesn't occur in the first string..but it's time to go home!
___________ Eric Hodges | [reply] [d/l] |
Re: Challenge: Fast Common Substrings
by BrowserUk (Patriarch) on Apr 05, 2007 at 15:23 UTC | |
Okay here is a benchmark as requested. I doubt it will satisfy everyone. Some comments on the benchmark and results obtained. Results (500 random strings of 'A'..'D', tested each against the next for common substrings 3-7 characters Read more... (4 kB)
The benchmark code. CLI parameters are: -N=nn numbers of strings to generate; -LENGTH=mm: length of common substrings to look for; Interesting challenge Limbic~Region, thanks.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
by BrowserUk (Patriarch) on Apr 05, 2007 at 15:34 UTC | |
The same benchmark as above but excluding the two slowest/erroneous algorithms, allows me to run 1000 strings looking for common substrings of lengths 2 .. 6 and get more digestable results: Read more... (2 kB)
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
by eric256 (Parson) on Apr 06, 2007 at 20:14 UTC | |
/me benchmarks agian with his in there as a matter of pride.
___________ Eric Hodges | [reply] [d/l] |
by eric256 (Parson) on Apr 06, 2007 at 20:07 UTC | |
Hey I misread the definition of substring to mean all substrings of that length or lower...quick easy fix and it starts placing decent.
___________ Eric Hodges | [reply] [d/l] |
Re: Challenge: Fast Common Substrings
by Moron (Curate) on Apr 04, 2007 at 10:01 UTC | |
Description of alggorithm: Store all substrings of string 1 of the given length in subhash $found{ $string1 }; do the same for string 2 but in addition lookup in the first hash and record any matches in the %match hash. The suggested opportunities for optimisation include (a) only need to check matches for the substring iteration of the second string. (b) only need to build the substring for string 2 if there is no match found in the match hash. -M Free your mind | [reply] [d/l] |
by BrowserUk (Patriarch) on Apr 04, 2007 at 10:25 UTC | |
# untested True. It doesn't even compile. And when you've corrected the numerous errors, it doesn't work. And it's hard to see how you thought it would. Care to enlighten? (And did you notice the word "fast" in the title?) Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
by Moron (Curate) on Apr 04, 2007 at 12:12 UTC | |
re "enlighten" - I think under the circumstances the most appropriate enlightment I can offer is to improve your lateral thinking In particular I recommend Six Thinking Hats. You would benefit from the insight that you tend to be stuck wearing a black hat (negative logic) and need to learn how to take it off at will and replace it with other forms of thinking when needed. If you need evidence - In your post you say on the one hand you find it difficult to see how I expect it to work and yet you go on to imply an ability to assess its performance when corrected. That's like saying the performance of anything in a category is X without realising the scope you have already accepted for that category, making it impossible to make any sensible such judgement. Conversely, all I am offering is a means to use hashes to solve the problem and I sincerely expect this to perform well when corrected given the pure simplicity of the algorithm and the opportunities it uses to cut out unnecessary iteration. I think you will find that many beginners out there will find it easy enough to correct this code and get a fast-performing result. So why not you, eh? You're clearly not nearly a beginner are you. So what else can it be ... ? -M Free your mind | [reply] |
by BrowserUk (Patriarch) on Apr 04, 2007 at 12:27 UTC | |
|