johnnywang has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have an array of strings, some of which are substrings of others, I'd like to remove those "short" string, only keep the longest. The following simple-minded implementation gives an n^2 algorithm. I'm wondering whether there is something faster, like n*log(n). Thanks.
use strict; my @data = qw/a mnk ab m b bc abcd cd bcd bd m nk /; my @result = (); foreach $a(@data){ my $is_sub = 0; foreach $b(@data){ $b =~ /$a/i and length($b) != length($a) and $is_sub = 1 and las +t; } push @result,$a unless $is_sub; } print join(",",@result); # output is: mnk,abcd,bd

Replies are listed 'Best First'.
Re: Seeking the longest string in an array
by ikegami (Patriarch) on Dec 11, 2004 at 22:38 UTC

    Update: Fixed. As Jenda observed, my original solution solved a problem other than the one presented.

    O(N*log(N) + N*(M + M3) + M)
    = O(N*log(N) + N*M3)
    N = number of words.
    M = length of words.

    Don't let the M3 scare you. Benchmark them to find out which is best for you.

    use strict; use warnings; my @data = qw/a mnk ab m b bc abcd cd bcd bd m nk /; @data = sort { length($a) <=> length($b) } @data; my %result; foreach my $word (@data) { my $uc_word = uc($word); my $partial_word; foreach my $start_letter (1..length($uc_word)) { foreach my $num_letters ($start_letter..length($uc_word)-$start_ +letter+1) { delete($result{substr($uc_word, $start_letter-1, $num_letters +)}); } } $result{$uc_word} = $word; } print join(",", values(%result));

      I don't think your code is correct. You are only removing the items that are equal to some prefix of the currently processed $word. You should delete all substrings, not only the prefixes. That would make your solution O(N*log N + N*M*M). We don't know the expected N and M, but this makes the M much more important.

      Jenda
      We'd like to help you learn to help yourself
      Look around you, all you see are sympathetic eyes
      Stroll around the grounds until you feel at home
         -- P. Simon in Mrs. Robinson

        Shoot, you're right. I misunderstood what his program did.
Re: Seeking the longest string in an array
by Jenda (Abbot) on Dec 11, 2004 at 23:28 UTC

    Can you give us some more details? How long are your strings on average? How many do you expect to have in the list? What characters do they contain?

    This looks like you might be doing something biology related so I would not be surprised if the strings you are working for consisted of only ACGT characters or something similar.

    Also there are two problems with your code. First, you should not be using the $a and $b variables. They are kinda special (used inside the block optionaly passed to sort()) which is why "use strict" did not complain about them not being declared.

    Second, you are not excaping special characters in $a when making the regexp, what if the "words" contain some dots, question marks, asterisks, parenthesis, ... You should use $b =~ /\Q$a\E/i if you want to use the regexp to find the substring. There is actually no reason to do that, you'd better use the index() function. No need to escape anything and it should be quicker. Though if you really want your comparison to be case insensitive you'd have to uppercase (or lowercase) both strings before you try to find one in the other.

    Last thing, finding the strings length is quick. Much quicker than trying to find a substring. You should revert your condition and only try to find short strings in longer ones. A ten chars long string cannot contain a 20 char long one ;-)

    Jenda
    We'd like to help you learn to help yourself
    Look around you, all you see are sympathetic eyes
    Stroll around the grounds until you feel at home
       -- P. Simon in Mrs. Robinson

      Thanks, acutally my strings are relatively short, let's say less than 30, but the number of strings can be large (not that large, for the moment, say 10K), so only N is important, not M. It can be safely assumed that the strings don't contain those special characters.
Re: Seeking the longest string in an array
by ikegami (Patriarch) on Dec 11, 2004 at 23:13 UTC
    That's not O(N2). /$a/ involves some serious looping. Probably O(M2). So O(N2*M2) total
    N = number of words.
    M = length of word.
Re: Seeking the longest string in an array
by nedals (Deacon) on Dec 11, 2004 at 23:32 UTC
    use strict; my @data = qw/a mnk ab m b bc abcd cd bcd bd m nk/; @data = sort {length($b) <=> length($a)} @data; my @result = (); while (@data) { push @result, $data[0]; @data = grep { $data[0] !~ /\Q$_\E/ } @data; } print "@result\n";
    Update
    Changed to reflect ikegami's commets.
    Please explain the /\Q...\E/
    I'm not so sure about the /../i. The OP may want it to be case sensitive.
    Thanks.

    Update2:
    mpeters,
    Having read your post and jenda's above, I've incorporated 'index'. Your solution, as it stands, will not work because index returns a position on match and the grep expression needs to return false. So I did it this way.
    ... while (@data) { push @result, $data[0]; @data = grep { index($data[0], $_) == -1 } @data; } ...

      Change cmp to <=>.

      Change /.../ to /\Q...\E/.

      Change /.../ to /.../i.

      O(N*log(N) + N2*M2). That's a worse worst case than the OP. In practice, though, yours would be faster, because the inner loop will be get much smaller than N very fast.

        change $data[0] =~ /\Q$_\E/ to index($data[0], $_).

        Should be faster.

      The original code searched for the substrings case insensitively so you'd have to uc() or lc() the strings before passing to index() if you want to behave the same. There is also one more thing that might be worth trying. We might create a character bitmap of all strings beforehand and then only search for the substring if the longer string contains all the characters that the substring does:

      use strict; use Bit::Vector; my @data = qw/a mnk ab m b bc abcd cd bcd bd m nk/; @data = sort {length($b) <=> length($a)} @data; @data = map [$_, uc($_), Bit::Vector->new_Enum(256, join(',', map ord( +$_), (split //, uc($_))))->Block_Read()], @data; my @result = (); while (@data) { push @result, $data[0][0]; @data = grep { ($data[0][2] & $_->[2]) ne $_->[2] or index($data[0 +][1], $_->[1]) == -1 } @data; } print "@result\n";
      Whether and how much would this help depends on the strings. In this case I saved just 5 calls to index() from the original 18.

      Jenda
      We'd like to help you learn to help yourself
      Look around you, all you see are sympathetic eyes
      Stroll around the grounds until you feel at home
         -- P. Simon in Mrs. Robinson

Re: Seeking the longest string in an array
by dave_the_m (Monsignor) on Dec 12, 2004 at 01:07 UTC
    my @data = qw/a mnk ab m b bc abcd cd bcd bd m nk /; my $prev = ''; @data = grep { substr($prev, 0, length) ne $_ and ($prev = $_,1) } reverse sort @data;

    update: Whoops, I hought it was initial substrings. Thanks ikegami for pointing out my foolishosity!

    Dave.

      Doesn't work. Gives
      nk, mnk, cd, bd, bcd, abcd
      instead of
      mnk, abcd, bd
Re: Seeking the longest string in an array
by Arunbear (Prior) on Dec 11, 2004 at 23:51 UTC
    First, sort the array. Then in your outer loop, use a binary search instead of a linear scan to test if the word is a substring.
Re: Seeking the longest string in an array
by TedPride (Priest) on Dec 12, 2004 at 21:40 UTC
    The problem with these solutions is they all require nested comparisons of text. It's much better to require only one comparison per item, and also to only require searching of text that has already been validated as being unique. You do this as follows:
    use strict; use warnings; my @data = qw/a mnk ab m b bc abcd cd bcd bd m nk/; (my $result, @data) = sort {length($b) <=> length($a)} @data; for (@data) { $result .= ' '.$_ if index($result, $_) == -1; } print $result;
    I think you'll find that this method easily beats any other in both processing speed and coding efficiency.