An efficient way to gather a common portion of several strings' beginnings

igoryonya has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: An efficient way to gather a common portion of several strings' beginnings by BrowserUk (Patriarch) on Nov 15, 2015 at 08:16 UTC
This is O(n) rather than the O(n²) of your solution: #! perl -slw use strict; my @strings = ( 'string that I need to gather the common base from: number 1 and some +other junk in it', 'string that I need to gather the common base from: number 2 and some +other junk in it', 'string that I need to gather the common base number 4 and some other +junk in it', 'string that I need to gather the common base from: number 3 and some +other junk in it', ); my( $mask ) = ( $strings[ 0 ] ^ $strings[ 1 ] ) =~ m[(^\0+)]; my $common = substr $strings[ 0 ], 0, length $mask; for my $i ( 2 .. $#strings ) { if( substr( $strings[ $i ], 0, length $common ) ne $common ) { ( $mask ) = ( $strings[ 0 ] ^ $strings[ $i ] ) =~ m[(^\0+)]; $common = substr $strings[ 0 ], 0, length $mask; } } print "'$common'"; __END__ C:\test>1147616 'string that I need to gather the common base ' [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^2: An efficient way to gather a common portion of several strings' beginnings by AnomalousMonk (Archbishop) on Nov 15, 2015 at 17:01 UTC
The `m[(^\0+)]` match results in an undefined value for `$mask` when there is no common base substring at all, which then hiccups a "Use of uninitialized..." warning in the substr expression (if warnings are enabled). Happily, this is easily fixed by using `m[(^\0*)]` instead! Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^3: An efficient way to gather a common portion of several strings' beginnings by BrowserUk (Patriarch) on Nov 15, 2015 at 17:38 UTC
Good catch and nice solution; but if the OP is expecting there to be a common prefix to his lines, then that warning might be a godsend in the event of a dataset where that is not the case. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^2: An efficient way to gather a common portion of several strings' beginnings by AnomalousMonk (Archbishop) on Nov 15, 2015 at 18:16 UTC
This solution fails for the null-infested `(qq{\000\000\000}, qq{\000\000}, qq{\000})` list of strings (as does GrandFather's, but GrandFather (update: explicitly) assumes nulls will not be present in any strings). The same list reversed produces the proper result. Some other solutions seem to accept nulls happily. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^3: An efficient way to gather a common portion of several strings' beginnings by BrowserUk (Patriarch) on Nov 15, 2015 at 18:28 UTC
This solution fails for the null-infested (qq{\000\000\000}, qq{\000\000}, qq{\000}) list of strings And what you got from the OPs sample data is that his data is infested with nulls? but GrandFather assumes nulls will not be present in any strings And mine doesn't? Oh. I see. He states that he's assuming it rather than leaving the bloody obvious to be bloody obvious. (Whatever did I do to rattle your cage recently?) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^4: An efficient way to gather a common portion of several strings' beginnings by AnomalousMonk (Archbishop) on Nov 15, 2015 at 21:50 UTC
Re^2: An efficient way to gather a common portion of several strings' beginnings by Anonymous Monk on Nov 15, 2015 at 18:45 UTC
Speaking of algorithmic efficiency. Consider the case where `$string[17] eq "x"` and preceding values all span kilobytes. Clearly, there's some room for improvement.	[reply] [d/l]
Re^3: An efficient way to gather a common portion of several strings' beginnings by BrowserUk (Patriarch) on Nov 15, 2015 at 19:03 UTC
Speaking of algorithmic efficiency. Consider the case where $string17 eq "x" and preceding values all span kilobytes. Clearly, there's some room for improvement. Hm. You could sort the strings, but sorting is O(n log n) instead of O(n). Still think so? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^2: An efficient way to gather a common portion of several strings' beginnings by lonewolf28 (Beadle) on Nov 17, 2015 at 00:09 UTC
++ Superb solution /u/BrowserUk. It took me some time to figure out. Cheers.	[reply]
Re: An efficient way to gather a common portion of several strings' beginnings by GrandFather (Saint) on Nov 15, 2015 at 11:34 UTC
Assuming none of your strings contain nulls: use strict; use warnings; my @strings = ( 'string that I need to gather the common base from: number 1 and s +ome other junk in it', 'string that I need to gather the common base from: number 2 and s +ome other junk in it', 'string that I need to gather the common base number 4 and some ot +her junk in it', 'string that I need to gather the common base from: number 3 and s +ome other junk in it', ); my $common = $strings[0]; for my $str (@strings[1 .. $#strings]) { ($common ^ $str) =~ m/^\0/; $common = substr $str, 0, $+[0] if $+[0] < length $common; } print "'$common'"; [download] Prints: `'string that I need to gather the common base '` [download] The xor operator ('^') combines the strings byte by byte and generates a null for each identical byte pair. @+ contains the offsets of the ends of matches. In this case the entire match is just the ticket so we use the first entry, 0, which is effectively the length of the common base. <Update:* Fixed match issue pointed out by AnomalousMonk. Premature optimization is the root of all job security	[reply] [d/l] [select]
Re^2: An efficient way to gather a common portion of several strings' beginnings by AnomalousMonk (Archbishop) on Nov 15, 2015 at 18:26 UTC
`($common ^ $str) =~ m/^\0+/;` Please see reply above regarding `m/^\0+/` versus `m/^\0*/` match. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re: An efficient way to gather a common portion of several strings' beginnings by atcroft (Abbot) on Nov 15, 2015 at 08:22 UTC
I would suggest starting by ordering your array by string length, shortest to longest. Then, take element 0 as your first approximation at the shortest substring and compare it to the same length section (via substr()) to the next string. If they do not match, reduce the test sequence until you arrive at a match, or an empty string. Repeat until you have examined all strings, or have an empty approximation string. Untested code example: `my @string = sort { length $a <=> length $b } ( qw/ quux asdfasdfasdf asdfasdf asdfzxcv as / ); my $common = shift @string; while ( my $teststr = shift @string and length $common ) { if ( $common eq substr( $teststr, 0, length $common ) ) { next; } my $flag = length $common; while ( $flag ) { if ( $common eq substr( $teststr, 0, $flag ) ) { $flag = 0; } else { $flag--; $common = substr( $common, 0, $flag ); } } } # print $common;` [download] Hope that helps. Update: 2015-11-15 Fixed errors in code sample. (My thanks to oiskuu for pointing their existence!)	[reply] [d/l]
Re: An efficient way to gather a common portion of several strings' beginnings by hippo (Archbishop) on Nov 15, 2015 at 13:11 UTC
If you first sort the array lexically all you need to do is compare the first and last entries. I've not benched it but would expect this to be most efficient for any large array.	[reply]
Re: An efficient way to gather a common portion of several strings' beginnings by LanX (Saint) on Nov 15, 2015 at 20:11 UTC
TIMTOWTDI personally I much prefer hippos approach with lexical sorting! FWIW here a solution with a regex searching through a concatenation of all strings for the longest repeated pattern: `use strict; use warnings; my @strings = ( 'string that I need to gather the common base from: number 1 and s +ome other junk in it', 'string that I need to gather the common base from: number 2 and s +ome other junk in it', 'string that I need to gather the common base from: number 3 and s +ome other junk in it', ); my $sep = "\0"; my $nosep = "[^$sep]"; my $all = join $sep, @strings; $all =~ /^ ($nosep) $nosep? ( $sep \1 $nosep*? )+ $/x; print "solution: '$1'";` [download] OUTPUT: `solution: 'string that I need to gather the common base from: number '` Please note that you need to take care that the separator (here \0 ) is not part of any string. You could easily check by counting $sep in $all. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply] [d/l] [select]
Re^2: An efficient way to gather a common portion of several strings' beginnings by oiskuu (Hermit) on Nov 15, 2015 at 21:52 UTC
Lexical sort is `nlog(n)m` (character comparisons) vs `nm` of the basic loop. It is certainly not optimal for arbitrary inputs. Update.* Some benchmarking shows either of the two following versions ought to perform adequately, in practice. I'll leave it to the reader's discretion... `# however, these assume strings don't contain \0 sub lcp_v4 { my ($e, $s) = ("", @_); $e \|= $_ ^ $s for @_; $e =~ m/^\0/; substr($s, 0, $+[0]); } sub lcp_v5 { my ($a, $b) = (sort @_)[0,-1]; ($a ^ $b) =~ m/^\0/; substr($a, 0, $+[0]); } use List::Util qw(minstr maxstr); sub lcp_v6 { my ($a, $b) = (&minstr, &maxstr); ($a ^ $b) =~ m/^\0/; substr($a, 0, $+[0]); }` [download] Update 2.* Sometimes one is the most oblivious to the most obvious. Added the third version lcp_v6.	[reply] [d/l] [select]
Re^3: An efficient way to gather a common portion of several strings' beginnings by LanX (Saint) on Nov 15, 2015 at 22:02 UTC
~~True~~ Maybe true but A) simple is beautiful :) B) arbitrary input where performance matters ? This won't have any meaningful substrings other than "" Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply]
Re^3: An efficient way to gather a common portion of several strings' beginnings by LanX (Saint) on Nov 17, 2015 at 20:57 UTC
Maybe of interest: I tried to beat sort, where much of the result is useless, cause we only need the first and last element and nothing in between. The results are not too spectacular, though its requires less memory: use warnings; use strict; use Time::HiRes qw/time/; my $common = join "","a".."z"; my @x = map { $common . rand 1 } 1..1e6; my (@a,$a,$b,$start); print "\n--- with full sort\n"; @a=@x; $start=time(); my @b= sort @a; print $b[0],"\n",$b[-1],"\n"; print time -$start,"\n"; print "\n--- with triple sort\n"; @a=@x; $start=time(); $a= shift @a; $b= shift @a; ($a,undef,$b) = sort($a,$_,$b) for @a; print $a,"\n",$b,"\n"; print time -$start,"\n"; print "\n--- with assignment\n"; @a=@x; $start=time(); $a= shift @a; $b= shift @a; for my $x (@a) { $a = ($x,$x,$a)[$a cmp $x]; #next if $a eq $x; $b = ($x,$b,$x)[$b cmp $x]; } print $a,"\n",$b,"\n"; print time -$start,"\n"; print "\n--- with goto\n"; @a=@x; $start=time(); $a= shift @a; $b= shift @a; for my $x (@a) { goto ("NEXT", "NEWMIN", "MAYBEMAX")[$a cmp $x]; NEWMIN: $a=$x; next; MAYBEMAX: goto ("NEXT", "NEXT", "NEWMAX" )[$b cmp $x]; NEWMAX: $b=$x; NEXT: } print $a,"\n",$b,"\n"; print time -$start,"\n"; [download] output: --- with full sort abcdefghijklmnopqrstuvwxyz0.000100396248079448 abcdefghijklmnopqrstuvwxyz9.99150346814304e-06 7.66956496238708 --- with triple sort abcdefghijklmnopqrstuvwxyz0.000100396248079448 abcdefghijklmnopqrstuvwxyz9.99150346814304e-06 5.88848209381104 --- with assignment abcdefghijklmnopqrstuvwxyz0.000100396248079448 abcdefghijklmnopqrstuvwxyz9.99150346814304e-06 1.2906858921051 --- with goto abcdefghijklmnopqrstuvwxyz0.000100396248079448 abcdefghijklmnopqrstuvwxyz9.99150346814304e-06 2.99558115005493 [download] Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!} update either I'm working too hard or Alzheimer is knocking at the door. I completely forgot about `le` and `ge` and spend too much effort into emulation `print "\n--- with le/ge\n"; @a=@x; $start=time(); $a= shift @a; $b= shift @a; for my $x (@a) { $a = $x if $a ge $x; $b = $x if $b le $x; } print $a,"\n",$b,"\n"; print time -$start,"\n"; __END__ --- with le/ge abcdefghijklmnopqrstuvwxyz0.000100420329498974 abcdefghijklmnopqrstuvwxyz9.95282924378671e-05 0.773550033569336` [download]	[reply] [d/l] [select]
Re^3: An efficient way to gather a common portion of several strings' beginnings by LanX (Saint) on Nov 17, 2015 at 00:36 UTC
> Update. Some benchmarking ... your version is of course faster `DB<129> push @a, join "",a..z,rand(1) for 1..1e6 => "" DB<130> $start=time;print lcp_v4(@a);time -$start => 1.25800704956055 abcdefghijklmnopqrstuvwxyz DB<131> $start=time;print lcp_v5(@a);time -$start => 6.93630909919739 abcdefghijklmnopqrstuvwxyz` [download] > # however, these assume strings don't end with \0 See thats the benefit of a simple algorithm, sort doesn't care about \0, one could easily replace the comparison of 2 lines with something stable without notable impact. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply] [d/l]
Re^3: An efficient way to gather a common portion of several strings' beginnings by LanX (Saint) on Nov 17, 2015 at 22:48 UTC
> Update 2. Sometimes one is the most oblivious to the most obvious. Added the third version lcp_v6. I could call using modules cheating, but a) it's core b) you have to admit that hippo's sort approach is sexy! `--- with le/ge abcdefghijklmnopqrstuvwxyz0.000100071512918021 abcdefghijklmnopqrstuvwxyz9.83499014282074e-05 0.786839962005615 --- with minstr/maxstr abcdefghijklmnopqrstuvwxyz0.000100071512918021 abcdefghijklmnopqrstuvwxyz9.83499014282074e-05 0.23346209526062` [download] Though your final comparison fails with some edge cases :-P Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply] [d/l]
Re: An efficient way to gather a common portion of several strings' beginnings by AnomalousMonk (Archbishop) on Nov 15, 2015 at 18:55 UTC
Here's my solution (probably not the best, but at least O(n)), along with testing of all the others except atcroft's, which has many failures. (These failures may have a simple fix, but I've no time right now.) Read more... (18 kB) Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^2: An efficient way to gather a common portion of several strings' beginnings by james28909 (Deacon) on Nov 15, 2015 at 20:49 UTC
You left me out of that test, I feel so unappreciated ;( haha :P	[reply]
Re: An efficient way to gather a common portion of several strings' beginnings by james28909 (Deacon) on Nov 15, 2015 at 16:02 UTC
Using capture groups, only print (or push or add to hash) values that match. use strict; use warnings; my @strings = ( 'string that I need to gather the common base from: number 1 and some +other junk in it', 'string that I need to gather the common base from: number 2 and some +other junk in it', 'string that I need to gather the common base from: number 3 and some +other junk in it', 'string that doesnt meet the proper specifications: number 4 and some +other junk in it', 'another string that will fail to match: because it doesnt match DUH!' +, 'this string will fail as well' ); for(@strings){ print $1 . $2 . "\n" if (/^(string that I need to gather the commo +n base from:\s)(.*)/); } [download]	[reply] [d/l]
Re: An efficient way to gather a common portion of several strings' beginnings by oiskuu (Hermit) on Nov 15, 2015 at 20:22 UTC
Here's a version using List::Util. Pretty straight-forward. `#! /usr/bin/perl use strict; use warnings; use List::Util q(first); sub csp { my ($n, $s) = (-1, shift // return); while (++$n < length($s)) { my $c = substr($s,$n,1); last if first { substr($_,$n,1) ne $c } @_; } substr($s,0,$n); } print csp(qw(ahhaaaaa aha ahem));` [download] But sorting the strings by length could still prove beneficial, esp. if the data is already thus structured.	[reply] [d/l]

update