igoryonya has asked for the wisdom of the Perl Monks concerning the following question:

I have 2 or more strings, that I need to gather a common base from the beginning of the string. Here is how I do it:
my @strings = ( 'string that I need to gather the common base from: number 1 and some +other junk in it', 'string that I need to gather the common base from: number 2 and some +other junk in it', 'string that I need to gather the common base from: number 3 and some +other junk in it' ); #After processing, I should get: #'string that I need to gather the common base from: number ' my $string = ''; my ($match_string, @rest_strings) = @strings; for my $string_idx (0..length($match_string)){ if(scalar(@rest_strings) > (grep { substr($match_string, $string_idx, + 1) eq substr($_, $string_idx, 1) } @rest_strings)){ $string = substr($match_string, 0, $string_idx); last; } }
I am wondering, if there is a better, a more efficient solution to it?
  • Comment on An efficient way to gather a common portion of several strings' beginnings
  • Download Code

Replies are listed 'Best First'.
Re: An efficient way to gather a common portion of several strings' beginnings
by BrowserUk (Patriarch) on Nov 15, 2015 at 08:16 UTC

    This is O(n) rather than the O(n2) of your solution:

    #! perl -slw use strict; my @strings = ( 'string that I need to gather the common base from: number 1 and some +other junk in it', 'string that I need to gather the common base from: number 2 and some +other junk in it', 'string that I need to gather the common base number 4 and some other +junk in it', 'string that I need to gather the common base from: number 3 and some +other junk in it', ); my( $mask ) = ( $strings[ 0 ] ^ $strings[ 1 ] ) =~ m[(^\0+)]; my $common = substr $strings[ 0 ], 0, length $mask; for my $i ( 2 .. $#strings ) { if( substr( $strings[ $i ], 0, length $common ) ne $common ) { ( $mask ) = ( $strings[ 0 ] ^ $strings[ $i ] ) =~ m[(^\0+)]; $common = substr $strings[ 0 ], 0, length $mask; } } print "'$common'"; __END__ C:\test>1147616 'string that I need to gather the common base '

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.

      The  m[(^\0+)] match results in an undefined value for  $mask when there is no common base substring at all, which then hiccups a "Use of uninitialized..." warning in the substr expression (if warnings are enabled). Happily, this is easily fixed by using  m[(^\0*)] instead!


      Give a man a fish:  <%-{-{-{-<

        Good catch and nice solution; but if the OP is expecting there to be a common prefix to his lines, then that warning might be a godsend in the event of a dataset where that is not the case.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.

      This solution fails for the null-infested  (qq{\000\000\000}, qq{\000\000}, qq{\000}) list of strings (as does GrandFather's, but GrandFather (update: explicitly) assumes nulls will not be present in any strings). The same list reversed produces the proper result. Some other solutions seem to accept nulls happily.


      Give a man a fish:  <%-{-{-{-<

        This solution fails for the null-infested (qq{\000\000\000}, qq{\000\000}, qq{\000}) list of strings

        And what you got from the OPs sample data is that his data is infested with nulls?

        but GrandFather assumes nulls will not be present in any strings

        And mine doesn't?

        Oh. I see. He states that he's assuming it rather than leaving the bloody obvious to be bloody obvious. (Whatever did I do to rattle your cage recently?)


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.

      Speaking of algorithmic efficiency. Consider the case where $string[17] eq "x" and preceding values all span kilobytes. Clearly, there's some room for improvement.

        Speaking of algorithmic efficiency. Consider the case where $string17 eq "x" and preceding values all span kilobytes. Clearly, there's some room for improvement.

        Hm. You could sort the strings, but sorting is O(n log n) instead of O(n).

        Still think so?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.

      ++ Superb solution /u/BrowserUk. It took me some time to figure out. Cheers.

Re: An efficient way to gather a common portion of several strings' beginnings
by GrandFather (Saint) on Nov 15, 2015 at 11:34 UTC

    Assuming none of your strings contain nulls:

    use strict; use warnings; my @strings = ( 'string that I need to gather the common base from: number 1 and s +ome other junk in it', 'string that I need to gather the common base from: number 2 and s +ome other junk in it', 'string that I need to gather the common base number 4 and some ot +her junk in it', 'string that I need to gather the common base from: number 3 and s +ome other junk in it', ); my $common = $strings[0]; for my $str (@strings[1 .. $#strings]) { ($common ^ $str) =~ m/^\0*/; $common = substr $str, 0, $+[0] if $+[0] < length $common; } print "'$common'";

    Prints:

    'string that I need to gather the common base '

    The xor operator ('^') combines the strings byte by byte and generates a null for each identical byte pair. @+ contains the offsets of the ends of matches. In this case the entire match is just the ticket so we use the first entry, 0, which is effectively the length of the common base.

    <Update: Fixed match issue pointed out by AnomalousMonk.

    Premature optimization is the root of all job security
      ($common ^ $str) =~ m/^\0+/;

      Please see reply above regarding  m/^\0+/ versus  m/^\0*/ match.


      Give a man a fish:  <%-{-{-{-<

Re: An efficient way to gather a common portion of several strings' beginnings
by atcroft (Abbot) on Nov 15, 2015 at 08:22 UTC

    I would suggest starting by ordering your array by string length, shortest to longest. Then, take element 0 as your first approximation at the shortest substring and compare it to the same length section (via substr()) to the next string. If they do not match, reduce the test sequence until you arrive at a match, or an empty string. Repeat until you have examined all strings, or have an empty approximation string.

    Untested code example:

    my @string = sort { length $a <=> length $b } ( qw/ quux asdfasdfasdf asdfasdf asdfzxcv as / ); my $common = shift @string; while ( my $teststr = shift @string and length $common ) { if ( $common eq substr( $teststr, 0, length $common ) ) { next; } my $flag = length $common; while ( $flag ) { if ( $common eq substr( $teststr, 0, $flag ) ) { $flag = 0; } else { $flag--; $common = substr( $common, 0, $flag ); } } } # print $common;

    Hope that helps.

    Update: 2015-11-15
    Fixed errors in code sample. (My thanks to oiskuu for pointing their existence!)

Re: An efficient way to gather a common portion of several strings' beginnings
by hippo (Archbishop) on Nov 15, 2015 at 13:11 UTC

    If you first sort the array lexically all you need to do is compare the first and last entries. I've not benched it but would expect this to be most efficient for any large array.

Re: An efficient way to gather a common portion of several strings' beginnings
by LanX (Saint) on Nov 15, 2015 at 20:11 UTC
    TIMTOWTDI

    personally I much prefer hippos approach with lexical sorting!

    FWIW here a solution with a regex searching through a concatenation of all strings for the longest repeated pattern:

    use strict; use warnings; my @strings = ( 'string that I need to gather the common base from: number 1 and s +ome other junk in it', 'string that I need to gather the common base from: number 2 and s +ome other junk in it', 'string that I need to gather the common base from: number 3 and s +ome other junk in it', ); my $sep = "\0"; my $nosep = "[^$sep]"; my $all = join $sep, @strings; $all =~ /^ ($nosep*) $nosep*? ( $sep \1 $nosep*? )+ $/x; print "solution: '$1'";

    OUTPUT:

    solution: 'string that I need to gather the common base from: number '

    Please note that you need to take care that the separator (here \0 ) is not part of any string.

    You could easily check by counting $sep in $all.

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

      Lexical sort is n*log(n)*m (character comparisons) vs n*m of the basic loop. It is certainly not optimal for arbitrary inputs.

      Update. Some benchmarking shows either of the two following versions ought to perform adequately, in practice. I'll leave it to the reader's discretion...

      # however, these assume strings don't contain \0 sub lcp_v4 { my ($e, $s) = ("", @_); $e |= $_ ^ $s for @_; $e =~ m/^\0*/; substr($s, 0, $+[0]); } sub lcp_v5 { my ($a, $b) = (sort @_)[0,-1]; ($a ^ $b) =~ m/^\0*/; substr($a, 0, $+[0]); } use List::Util qw(minstr maxstr); sub lcp_v6 { my ($a, $b) = (&minstr, &maxstr); ($a ^ $b) =~ m/^\0*/; substr($a, 0, $+[0]); }

      Update 2. Sometimes one is the most oblivious to the most obvious. Added the third version lcp_v6.

        True Maybe true but

        A) simple is beautiful :)

        B) arbitrary input where performance matters ? This won't have any meaningful substrings other than ""

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

        Maybe of interest:

        I tried to beat sort, where much of the result is useless, cause we only need the first and last element and nothing in between.

        The results are not too spectacular, though its requires less memory:

        use warnings; use strict; use Time::HiRes qw/time/; my $common = join "","a".."z"; my @x = map { $common . rand 1 } 1..1e6; my (@a,$a,$b,$start); print "\n--- with full sort\n"; @a=@x; $start=time(); my @b= sort @a; print $b[0],"\n",$b[-1],"\n"; print time -$start,"\n"; print "\n--- with triple sort\n"; @a=@x; $start=time(); $a= shift @a; $b= shift @a; ($a,undef,$b) = sort($a,$_,$b) for @a; print $a,"\n",$b,"\n"; print time -$start,"\n"; print "\n--- with assignment\n"; @a=@x; $start=time(); $a= shift @a; $b= shift @a; for my $x (@a) { $a = ($x,$x,$a)[$a cmp $x]; #next if $a eq $x; $b = ($x,$b,$x)[$b cmp $x]; } print $a,"\n",$b,"\n"; print time -$start,"\n"; print "\n--- with goto\n"; @a=@x; $start=time(); $a= shift @a; $b= shift @a; for my $x (@a) { goto ("NEXT", "NEWMIN", "MAYBEMAX")[$a cmp $x]; NEWMIN: $a=$x; next; MAYBEMAX: goto ("NEXT", "NEXT", "NEWMAX" )[$b cmp $x]; NEWMAX: $b=$x; NEXT: } print $a,"\n",$b,"\n"; print time -$start,"\n";

        output:

        --- with full sort abcdefghijklmnopqrstuvwxyz0.000100396248079448 abcdefghijklmnopqrstuvwxyz9.99150346814304e-06 7.66956496238708 --- with triple sort abcdefghijklmnopqrstuvwxyz0.000100396248079448 abcdefghijklmnopqrstuvwxyz9.99150346814304e-06 5.88848209381104 --- with assignment abcdefghijklmnopqrstuvwxyz0.000100396248079448 abcdefghijklmnopqrstuvwxyz9.99150346814304e-06 1.2906858921051 --- with goto abcdefghijklmnopqrstuvwxyz0.000100396248079448 abcdefghijklmnopqrstuvwxyz9.99150346814304e-06 2.99558115005493

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

        update

        either I'm working too hard or Alzheimer is knocking at the door.

        I completely forgot about le and ge and spend too much effort into emulation

        print "\n--- with le/ge\n"; @a=@x; $start=time(); $a= shift @a; $b= shift @a; for my $x (@a) { $a = $x if $a ge $x; $b = $x if $b le $x; } print $a,"\n",$b,"\n"; print time -$start,"\n"; __END__ --- with le/ge abcdefghijklmnopqrstuvwxyz0.000100420329498974 abcdefghijklmnopqrstuvwxyz9.95282924378671e-05 0.773550033569336
        > Update. Some benchmarking ...

        your version is of course faster

        DB<129> push @a, join "",a..z,rand(1) for 1..1e6 => "" DB<130> $start=time;print lcp_v4(@a);time -$start => 1.25800704956055 abcdefghijklmnopqrstuvwxyz DB<131> $start=time;print lcp_v5(@a);time -$start => 6.93630909919739 abcdefghijklmnopqrstuvwxyz

        > # however, these assume strings don't end with \0

        See thats the benefit of a simple algorithm, sort doesn't care about \0, one could easily replace the comparison of 2 lines with something stable without notable impact.

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

        > Update 2. Sometimes one is the most oblivious to the most obvious. Added the third version lcp_v6.

        I could call using modules cheating, but

        a) it's core

        b) you have to admit that hippo's sort approach is sexy!

        --- with le/ge abcdefghijklmnopqrstuvwxyz0.000100071512918021 abcdefghijklmnopqrstuvwxyz9.83499014282074e-05 0.786839962005615 --- with minstr/maxstr abcdefghijklmnopqrstuvwxyz0.000100071512918021 abcdefghijklmnopqrstuvwxyz9.83499014282074e-05 0.23346209526062

        Though your final comparison fails with some edge cases :-P

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

Re: An efficient way to gather a common portion of several strings' beginnings
by AnomalousMonk (Archbishop) on Nov 15, 2015 at 18:55 UTC

    Here's my solution (probably not the best, but at least O(n)), along with testing of all the others except atcroft's, which has many failures. (These failures may have a simple fix, but I've no time right now.)


    Give a man a fish:  <%-{-{-{-<

      You left me out of that test, I feel so unappreciated ;( haha :P
Re: An efficient way to gather a common portion of several strings' beginnings
by james28909 (Deacon) on Nov 15, 2015 at 16:02 UTC
    Using capture groups, only print (or push or add to hash) values that match.
    use strict; use warnings; my @strings = ( 'string that I need to gather the common base from: number 1 and some +other junk in it', 'string that I need to gather the common base from: number 2 and some +other junk in it', 'string that I need to gather the common base from: number 3 and some +other junk in it', 'string that doesnt meet the proper specifications: number 4 and some +other junk in it', 'another string that will fail to match: because it doesnt match DUH!' +, 'this string will fail as well' ); for(@strings){ print $1 . $2 . "\n" if (/^(string that I need to gather the commo +n base from:\s)(.*)/); }
Re: An efficient way to gather a common portion of several strings' beginnings
by oiskuu (Hermit) on Nov 15, 2015 at 20:22 UTC

    Here's a version using List::Util. Pretty straight-forward.

    #! /usr/bin/perl use strict; use warnings; use List::Util q(first); sub csp { my ($n, $s) = (-1, shift // return); while (++$n < length($s)) { my $c = substr($s,$n,1); last if first { substr($_,$n,1) ne $c } @_; } substr($s,0,$n); } print csp(qw(ahhaaaaa aha ahem));

    But sorting the strings by length could still prove beneficial, esp. if the data is already thus structured.