Re: An efficient way to gather a common portion of several strings' beginnings
by BrowserUk (Patriarch) on Nov 15, 2015 at 08:16 UTC
|
#! perl -slw
use strict;
my @strings = (
'string that I need to gather the common base from: number 1 and some
+other junk in it',
'string that I need to gather the common base from: number 2 and some
+other junk in it',
'string that I need to gather the common base number 4 and some other
+junk in it',
'string that I need to gather the common base from: number 3 and some
+other junk in it',
);
my( $mask ) = ( $strings[ 0 ] ^ $strings[ 1 ] ) =~ m[(^\0+)];
my $common = substr $strings[ 0 ], 0, length $mask;
for my $i ( 2 .. $#strings ) {
if( substr( $strings[ $i ], 0, length $common ) ne $common ) {
( $mask ) = ( $strings[ 0 ] ^ $strings[ $i ] ) =~ m[(^\0+)];
$common = substr $strings[ 0 ], 0, length $mask;
}
}
print "'$common'";
__END__
C:\test>1147616
'string that I need to gather the common base '
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
|
The m[(^\0+)] match results in an undefined value for $mask when there is no common base substring at all, which then hiccups a "Use of uninitialized..." warning in the substr expression (if warnings are enabled). Happily, this is easily fixed by using m[(^\0*)] instead!
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
|
|
| [reply] |
|
|
This solution fails for the null-infested (qq{\000\000\000}, qq{\000\000}, qq{\000}) list of strings (as does GrandFather's, but GrandFather (update: explicitly) assumes nulls will not be present in any strings). The same list reversed produces the proper result. Some other solutions seem to accept nulls happily.
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
|
|
This solution fails for the null-infested (qq{\000\000\000}, qq{\000\000}, qq{\000}) list of strings
And what you got from the OPs sample data is that his data is infested with nulls?
but GrandFather assumes nulls will not be present in any strings
And mine doesn't?
Oh. I see. He states that he's assuming it rather than leaving the bloody obvious to be bloody obvious. (Whatever did I do to rattle your cage recently?)
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|
|
|
| [reply] [d/l] |
|
|
| [reply] |
|
|
| [reply] |
Re: An efficient way to gather a common portion of several strings' beginnings
by GrandFather (Saint) on Nov 15, 2015 at 11:34 UTC
|
use strict;
use warnings;
my @strings = (
'string that I need to gather the common base from: number 1 and s
+ome other junk in it',
'string that I need to gather the common base from: number 2 and s
+ome other junk in it',
'string that I need to gather the common base number 4 and some ot
+her junk in it',
'string that I need to gather the common base from: number 3 and s
+ome other junk in it',
);
my $common = $strings[0];
for my $str (@strings[1 .. $#strings]) {
($common ^ $str) =~ m/^\0*/;
$common = substr $str, 0, $+[0] if $+[0] < length $common;
}
print "'$common'";
Prints:
'string that I need to gather the common base '
The xor operator ('^') combines the strings byte by byte and generates a null for each identical byte pair. @+ contains the offsets of the ends of matches. In this case the entire match is just the ticket so we use the first entry, 0, which is effectively the length of the common base.
<Update: Fixed match issue pointed out by AnomalousMonk.
Premature optimization is the root of all job security
| [reply] [d/l] [select] |
|
|
| [reply] [d/l] [select] |
Re: An efficient way to gather a common portion of several strings' beginnings
by atcroft (Abbot) on Nov 15, 2015 at 08:22 UTC
|
I would suggest starting by ordering your array by string length, shortest to longest. Then, take element 0 as your first approximation at the shortest substring and compare it to the same length section (via substr()) to the next string. If they do not match, reduce the test sequence until you arrive at a match, or an empty string. Repeat until you have examined all strings, or have an empty approximation string.
Untested code example:
my @string
= sort { length $a <=> length $b }
( qw/ quux asdfasdfasdf asdfasdf asdfzxcv as / );
my $common = shift @string;
while ( my $teststr = shift @string and length $common ) {
if ( $common eq substr( $teststr, 0, length $common ) ) {
next;
}
my $flag = length $common;
while ( $flag ) {
if ( $common eq substr( $teststr, 0, $flag ) ) {
$flag = 0;
}
else {
$flag--;
$common = substr( $common, 0, $flag );
}
}
}
# print $common;
Hope that helps.
Update: 2015-11-15
Fixed errors in code sample. (My thanks to oiskuu for pointing their existence!) | [reply] [d/l] |
Re: An efficient way to gather a common portion of several strings' beginnings
by hippo (Archbishop) on Nov 15, 2015 at 13:11 UTC
|
If you first sort the array lexically all you need to do is compare the first and last entries. I've not benched it but would expect this to be most efficient for any large array.
| [reply] |
Re: An efficient way to gather a common portion of several strings' beginnings
by LanX (Saint) on Nov 15, 2015 at 20:11 UTC
|
TIMTOWTDI
personally I much prefer hippos approach with lexical sorting!
FWIW here a solution with a regex searching through a concatenation of all strings for the longest repeated pattern:
use strict;
use warnings;
my @strings = (
'string that I need to gather the common base from: number 1 and s
+ome other junk in it',
'string that I need to gather the common base from: number 2 and s
+ome other junk in it',
'string that I need to gather the common base from: number 3 and s
+ome other junk in it',
);
my $sep = "\0";
my $nosep = "[^$sep]";
my $all = join $sep, @strings;
$all =~ /^ ($nosep*) $nosep*? ( $sep \1 $nosep*? )+ $/x;
print "solution: '$1'";
OUTPUT:
solution: 'string that I need to gather the common base from: number '
Please note that you need to take care that the separator (here \0 ) is not part of any string.
You could easily check by counting $sep in $all.
| [reply] [d/l] [select] |
|
|
Lexical sort is n*log(n)*m (character comparisons) vs n*m of the basic loop. It is certainly not optimal for arbitrary inputs.
Update. Some benchmarking shows either of the two following versions ought to perform adequately, in practice. I'll leave it to the reader's discretion...
# however, these assume strings don't contain \0
sub lcp_v4 {
my ($e, $s) = ("", @_);
$e |= $_ ^ $s for @_;
$e =~ m/^\0*/;
substr($s, 0, $+[0]);
}
sub lcp_v5 {
my ($a, $b) = (sort @_)[0,-1];
($a ^ $b) =~ m/^\0*/;
substr($a, 0, $+[0]);
}
use List::Util qw(minstr maxstr);
sub lcp_v6 {
my ($a, $b) = (&minstr, &maxstr);
($a ^ $b) =~ m/^\0*/;
substr($a, 0, $+[0]);
}
Update 2. Sometimes one is the most oblivious to the most obvious. Added the third version lcp_v6.
| [reply] [d/l] [select] |
|
|
| [reply] |
|
|
Maybe of interest:
I tried to beat sort, where much of the result is useless, cause we only need the first and last element and nothing in between.
The results are not too spectacular, though its requires less memory:
use warnings;
use strict;
use Time::HiRes qw/time/;
my $common = join "","a".."z";
my @x = map { $common . rand 1 } 1..1e6;
my (@a,$a,$b,$start);
print "\n--- with full sort\n";
@a=@x;
$start=time();
my @b= sort @a;
print $b[0],"\n",$b[-1],"\n";
print time -$start,"\n";
print "\n--- with triple sort\n";
@a=@x;
$start=time();
$a= shift @a;
$b= shift @a;
($a,undef,$b) = sort($a,$_,$b) for @a;
print $a,"\n",$b,"\n";
print time -$start,"\n";
print "\n--- with assignment\n";
@a=@x;
$start=time();
$a= shift @a;
$b= shift @a;
for my $x (@a) {
$a = ($x,$x,$a)[$a cmp $x];
#next if $a eq $x;
$b = ($x,$b,$x)[$b cmp $x];
}
print $a,"\n",$b,"\n";
print time -$start,"\n";
print "\n--- with goto\n";
@a=@x;
$start=time();
$a= shift @a;
$b= shift @a;
for my $x (@a) {
goto ("NEXT", "NEWMIN", "MAYBEMAX")[$a cmp $x];
NEWMIN:
$a=$x;
next;
MAYBEMAX:
goto ("NEXT", "NEXT", "NEWMAX" )[$b cmp $x];
NEWMAX:
$b=$x;
NEXT:
}
print $a,"\n",$b,"\n";
print time -$start,"\n";
output:
--- with full sort
abcdefghijklmnopqrstuvwxyz0.000100396248079448
abcdefghijklmnopqrstuvwxyz9.99150346814304e-06
7.66956496238708
--- with triple sort
abcdefghijklmnopqrstuvwxyz0.000100396248079448
abcdefghijklmnopqrstuvwxyz9.99150346814304e-06
5.88848209381104
--- with assignment
abcdefghijklmnopqrstuvwxyz0.000100396248079448
abcdefghijklmnopqrstuvwxyz9.99150346814304e-06
1.2906858921051
--- with goto
abcdefghijklmnopqrstuvwxyz0.000100396248079448
abcdefghijklmnopqrstuvwxyz9.99150346814304e-06
2.99558115005493
update
either I'm working too hard or Alzheimer is knocking at the door.
I completely forgot about le and ge and spend too much effort into emulation
print "\n--- with le/ge\n";
@a=@x;
$start=time();
$a= shift @a;
$b= shift @a;
for my $x (@a) {
$a = $x if $a ge $x;
$b = $x if $b le $x;
}
print $a,"\n",$b,"\n";
print time -$start,"\n";
__END__
--- with le/ge
abcdefghijklmnopqrstuvwxyz0.000100420329498974
abcdefghijklmnopqrstuvwxyz9.95282924378671e-05
0.773550033569336
| [reply] [d/l] [select] |
|
|
DB<129> push @a, join "",a..z,rand(1) for 1..1e6
=> ""
DB<130> $start=time;print lcp_v4(@a);time -$start
=> 1.25800704956055
abcdefghijklmnopqrstuvwxyz
DB<131> $start=time;print lcp_v5(@a);time -$start
=> 6.93630909919739
abcdefghijklmnopqrstuvwxyz
> # however, these assume strings don't end with \0
See thats the benefit of a simple algorithm, sort doesn't care about \0, one could easily replace the comparison of 2 lines with something stable without notable impact.
| [reply] [d/l] |
|
|
--- with le/ge
abcdefghijklmnopqrstuvwxyz0.000100071512918021
abcdefghijklmnopqrstuvwxyz9.83499014282074e-05
0.786839962005615
--- with minstr/maxstr
abcdefghijklmnopqrstuvwxyz0.000100071512918021
abcdefghijklmnopqrstuvwxyz9.83499014282074e-05
0.23346209526062
Though your final comparison fails with some edge cases :-P
| [reply] [d/l] |
Re: An efficient way to gather a common portion of several strings' beginnings
by AnomalousMonk (Archbishop) on Nov 15, 2015 at 18:55 UTC
|
Here's my solution (probably not the best, but at least O(n)), along with testing of all the others except atcroft's, which has many failures. (These failures may have a simple fix, but I've no time right now.)
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
|
|
You left me out of that test, I feel so unappreciated ;( haha :P
| [reply] |
Re: An efficient way to gather a common portion of several strings' beginnings
by james28909 (Deacon) on Nov 15, 2015 at 16:02 UTC
|
Using capture groups, only print (or push or add to hash) values that match.
use strict;
use warnings;
my @strings = (
'string that I need to gather the common base from: number 1 and some
+other junk in it',
'string that I need to gather the common base from: number 2 and some
+other junk in it',
'string that I need to gather the common base from: number 3 and some
+other junk in it',
'string that doesnt meet the proper specifications: number 4 and some
+other junk in it',
'another string that will fail to match: because it doesnt match DUH!'
+,
'this string will fail as well'
);
for(@strings){
print $1 . $2 . "\n" if (/^(string that I need to gather the commo
+n base from:\s)(.*)/);
}
| [reply] [d/l] |
Re: An efficient way to gather a common portion of several strings' beginnings
by oiskuu (Hermit) on Nov 15, 2015 at 20:22 UTC
|
#! /usr/bin/perl
use strict;
use warnings;
use List::Util q(first);
sub csp {
my ($n, $s) = (-1, shift // return);
while (++$n < length($s)) {
my $c = substr($s,$n,1);
last if first { substr($_,$n,1) ne $c } @_;
}
substr($s,0,$n);
}
print csp(qw(ahhaaaaa aha ahem));
But sorting the strings by length could still prove beneficial, esp. if the data is already thus structured.
| [reply] [d/l] |