Seeking the longest string in an array

johnnywang has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Seeking the longest string in an array
by ikegami (Patriarch) on Dec 11, 2004 at 22:38 UTC

Update: Fixed. As Jenda observed, my original solution solved a problem other than the one presented.

O(N*log(N) + N*(M + M³) + M)
= O(N*log(N) + N*M³)
N = number of words.
M = length of words.

Don't let the M³ scare you. Benchmark them to find out which is best for you.

use strict;
use warnings;

my @data = qw/a mnk ab m b bc abcd cd bcd bd m nk /;

@data = sort { length($a) <=> length($b) } @data;

my %result;

foreach my $word (@data) {
   my $uc_word = uc($word);
   my $partial_word;

   foreach my $start_letter (1..length($uc_word)) {
      foreach my $num_letters ($start_letter..length($uc_word)-$start_
+letter+1) {
         delete($result{substr($uc_word, $start_letter-1, $num_letters
+)});
      }
   }

   $result{$uc_word} = $word;
}


print join(",", values(%result));
[download]

[reply]
[d/l]

Re^2: Seeking the longest string in an array

by Jenda (Abbot) on Dec 11, 2004 at 23:04 UTC

I don't think your code is correct. You are only removing the items that are equal to some prefix of the currently processed $word. You should delete all substrings, not only the prefixes. That would make your solution O(N*log N + N*M*M). We don't know the expected N and M, but this makes the M much more important.

Jenda
We'd like to help you learn to help yourself
Look around you, all you see are sympathetic eyes
Stroll around the grounds until you feel at home
-- P. Simon in Mrs. Robinson

[reply]

Re^3: Seeking the longest string in an array

by ikegami (Patriarch) on Dec 11, 2004 at 23:10 UTC

Shoot, you're right. I misunderstood what his program did.

[reply]

Re: Seeking the longest string in an array
by Jenda (Abbot) on Dec 11, 2004 at 23:28 UTC

Can you give us some more details? How long are your strings on average? How many do you expect to have in the list? What characters do they contain?

This looks like you might be doing something biology related so I would not be surprised if the strings you are working for consisted of only ACGT characters or something similar.

Also there are two problems with your code. First, you should not be using the $a and $b variables. They are kinda special (used inside the block optionaly passed to sort()) which is why "use strict" did not complain about them not being declared.

Second, you are not excaping special characters in $a when making the regexp, what if the "words" contain some dots, question marks, asterisks, parenthesis, ... You should use $b =~ /\Q$a\E/i if you want to use the regexp to find the substring. There is actually no reason to do that, you'd better use the index() function. No need to escape anything and it should be quicker. Though if you really want your comparison to be case insensitive you'd have to uppercase (or lowercase) both strings before you try to find one in the other.

Last thing, finding the strings length is quick. Much quicker than trying to find a substring. You should revert your condition and only try to find short strings in longer ones. A ten chars long string cannot contain a 20 char long one ;-)

Jenda
We'd like to help you learn to help yourself
Look around you, all you see are sympathetic eyes
Stroll around the grounds until you feel at home
-- P. Simon in Mrs. Robinson

[reply]
[d/l]

Re^2: Seeking the longest string in an array

by johnnywang (Priest) on Dec 12, 2004 at 06:18 UTC

Thanks, acutally my strings are relatively short, let's say less than 30, but the number of strings can be large (not that large, for the moment, say 10K), so only N is important, not M. It can be safely assumed that the strings don't contain those special characters.

[reply]

Re: Seeking the longest string in an array
by ikegami (Patriarch) on Dec 11, 2004 at 23:13 UTC

/$a/

[reply]
[d/l]

Re: Seeking the longest string in an array
by nedals (Deacon) on Dec 11, 2004 at 23:32 UTC

use strict;

my @data = qw/a mnk ab m b bc abcd cd bcd bd m nk/;
@data = sort {length($b) <=> length($a)} @data;

my @result = ();
while (@data) {
    push @result, $data[0];
    @data = grep { $data[0] !~ /\Q$_\E/  } @data;
}

print "@result\n";
[download]

Update

~~Please explain the /\Q...\E/~~

Update2:

...
while (@data) {
    push @result, $data[0];
    @data = grep { index($data[0], $_) == -1 } @data;
}
...
[download]

[reply]
[d/l]
[select]

Re^2: Seeking the longest string in an array

by ikegami (Patriarch) on Dec 11, 2004 at 23:38 UTC

Change cmp to <=>.

Change /.../ to /\Q...\E/.

Change /.../ to /.../i.

O(N*log(N) + N²*M²). That's a worse worst case than the OP. In practice, though, yours would be faster, because the inner loop will be get much smaller than N very fast.

[reply]
[d/l]
[select]

Re^3: Seeking the longest string in an array

by mpeters (Chaplain) on Dec 12, 2004 at 02:41 UTC

$data[0] =~ /\Q$_\E/

index($data[0], $_)

[reply]
[d/l]
[select]

Re^4: Seeking the longest string in an array

by ikegami (Patriarch) on Dec 12, 2004 at 05:56 UTC

Re^5: Seeking the longest string in an array

by mpeters (Chaplain) on Dec 13, 2004 at 15:22 UTC

Re^2: Seeking the longest string in an array

by Jenda (Abbot) on Dec 12, 2004 at 13:42 UTC

The original code searched for the substrings case insensitively so you'd have to uc() or lc() the strings before passing to index() if you want to behave the same. There is also one more thing that might be worth trying. We might create a character bitmap of all strings beforehand and then only search for the substring if the longer string contains all the characters that the substring does:

use strict;
use Bit::Vector;

my @data = qw/a mnk ab m b bc abcd cd bcd bd m nk/;
@data = sort {length($b) <=> length($a)} @data;

@data = map [$_, uc($_), Bit::Vector->new_Enum(256, join(',', map ord(
+$_), (split //, uc($_))))->Block_Read()], @data;

my @result = ();
while (@data) {
    push @result, $data[0][0];
    @data = grep { ($data[0][2] & $_->[2]) ne $_->[2] or index($data[0
+][1], $_->[1]) == -1 } @data;
}

print "@result\n";
[download]

Jenda
We'd like to help you learn to help yourself
Look around you, all you see are sympathetic eyes
Stroll around the grounds until you feel at home
-- P. Simon in Mrs. Robinson

[reply]
[d/l]

Re: Seeking the longest string in an array
by dave_the_m (Monsignor) on Dec 12, 2004 at 01:07 UTC

my @data = qw/a mnk ab m b bc abcd cd bcd bd m nk /; my $prev = ''; @data = grep { substr($prev, 0, length) ne $_ and ($prev = $_,1) } reverse sort @data;
[download]

update: Whoops, I hought it was initial substrings. Thanks ikegami for pointing out my foolishosity!

Dave.

[reply]
[d/l]

Re^2: Seeking the longest string in an array

by ikegami (Patriarch) on Dec 12, 2004 at 05:58 UTC

nk, mnk, cd, bd, bcd, abcd

mnk, abcd, bd

[reply]
[d/l]
[select]

Re: Seeking the longest string in an array
by Arunbear (Prior) on Dec 11, 2004 at 23:51 UTC

~~First, sort the array. Then in your outer loop, use a binary search instead of a linear scan to test if the word is a substring.~~

[reply]

Re: Seeking the longest string in an array
by TedPride (Priest) on Dec 12, 2004 at 21:40 UTC

use strict;
use warnings;

my @data = qw/a mnk ab m b bc abcd cd bcd bd m nk/;
(my $result, @data) = sort {length($b) <=> length($a)} @data;
for (@data) { $result .= ' '.$_ if index($result, $_) == -1; }
print $result;
[download]

[reply]
[d/l]