good chemistry is complicated,and a little bit messy -LW PerlMonks

Finding largest common subset in lists?

 on Jun 05, 2003 at 07:07 UTC Need Help??

anjiro has asked for the wisdom of the Perl Monks concerning the following question:

Given two lists, such as
```@a = qw(fred bob joe jim mary elaine);
@b = qw(frank joe jim mary bob);
What's the best way to find the largest subset of @b in @a? Here I mean the largest ordered subset: (joe, jim, mary) would be correct, but (bob, joe, jim, mary) would not.

The best (read "easiest") solution I could come up with is to join(',', @a) and then join(',', @b[\$start..\$end]), and then use the index function to see if (and where) the join from @b occurs in the join from @a. There must be some perl magic to do this in a more clean, faster way.

And yes, I know that what I'm asking is of quite high computational complexity. Thanks for any suggestions!

Replies are listed 'Best First'.
Re: Finding largest common subset in lists?
by Corion (Patriarch) on Jun 05, 2003 at 07:16 UTC

The magic is in the module Algorithm::Diff, which implements the Largest Common Subsequence (LCS), which is a more accurate name of the problem you describe (a set has no order).

```perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ;    # The
\$d = new HTTP::Daemon and fork and getprint \$d->url and exit;#spider
(\$c = \$d->accept())->get_request(); \$c->send_response( new   #in the
HTTP::Response(200,\$_,\$_,qq(Just another Perl hacker\n))); ' #  web
Close but no banana, unfortunately. Quoting from the manual:
...you want to find a new sequence S which can be obtained from the first sequence by deleting some items, and from the secend sequence by deleting other items.
What I need instead is to generate a new sequence S which can be obtained from the first list by deleting some items, and from the second list by deleting other items, but only deleting items before the first element of S or after the last element of S.

For example, given this code:

```use Algorithm::Diff qw(LCS);
@seq1 = qw(a b c d f g h j q z);
@seq2 = qw(a b c d e f g i j k r x y z);
@lcs = LCS(\@seq1, \@seq2);
What is produced is the list qw(a b c d f g j z). It did this by removing letters from the "inside" of S, for example 'e' and 'h'. Instead, I'd want qw(a b c d).
This is still unclear. To which part(s) of the sentence does "but only deleting items before the first element of S or after the last element of S" refer? To the entire preceding part of the sentence, or only to the part after "deleting some items"? More to the point, given the following two sequences:
```@seq1 = qw(a b c d e f g h);
@seq2 = qw(a b c e f x g h);
is the answer  a b c  or  a b c e f ?
Re: Finding largest common subset in lists?
by broquaint (Abbot) on Jun 05, 2003 at 09:45 UTC
Update: fixed code to work with zby's case and hopefully any other case.
Update: code won't work with duplicates in the second list

Not exactly vastly mystical but this should do the trick (although it hasn't been thoroughly tested)

```use strict;

my @a = qw/ fred bob joe jim mary elaine /;
my @b = qw/ frank joe jim mary bob /;

print "LCS[anjiro] - ", join(\$", get_lcs(\@a, \@b)), \$/;

@a = qw/ a b c /;
@b = qw/ a b x c /;

print "LCS[zby]    - ", join(\$", get_lcs(\@a, \@b)), \$/;

sub get_lcs {
my @a = @{ +shift };
my @b = @{ +shift };

my %map = map { \$b[\$_] => \$_ } 0 .. \$#b;

my(@lcs, @tmp);
for(0 .. \$#a) {
next unless exists \$map{\$a[\$_]}
or \$a[\$_ + 1] eq \$b[\$map{\$a[\$_]} + 1];

push @tmp, \$a[\$_] if \$a[\$_] eq \$b[\$map{\$a[\$_]}];

if(\$a[\$_ + 1] ne \$b[\$map{\$a[\$_]} + 1]) {
@lcs = @tmp if @tmp > @lcs;
@tmp = ();
}
}
@lcs = @tmp if @tmp > @lcs;

return @lcs;
}

__output__

LCS[anjiro] - joe jim mary
LCS[zby]    - a b
You might also find some roughly applicable questions under Longest Common Substring.
HTH

_________
broquaint

With use warnings it gives a Use of uninitialized value in string eq at... warning. Small off-by-one error.

Replacing

```   push @tmp, \$a[\$_]
if ( \$a[\$_ + 1]) eq ( \$b[\$map{\$a[\$_]} + 1] )
or @tmp >= 1;
with
```    push @tmp, \$a[\$_]
if (\$a[\$_ + 1] and \$b[\$map{\$a[\$_]} + 1]
and ( \$a[\$_ + 1]) eq ( \$b[\$map{\$a[\$_]} + 1] )
or @tmp >= 1);
makes Perl happy again.

Arjen

I just felt it was too simple. For @a = qw(a b c); @b = qw(a b x c) it prints LCS - a b c.
The updated code does not work for  @a = qw(a b c d); @b = qw(a b x b c d). It prints a b c d. The output is suprising for me - I created the inputs to catch another kind of mistake. I believe you can't do it in one sweep.
The output is due to the fact that 'b' is repeated in @b so the offset in %map is for the second 'b'. I think I'll leave the code as it is for the time being and just add a caveat that it won't work when there are duplicates in @b. Thanks once again :)

_________
broquaint

Re: Finding largest common subset in lists?
by antirice (Priest) on Jun 05, 2003 at 14:30 UTC
Ok, first thing's first. What you actually want is the Largest Common Contiguous Ordered Subset. And so I give you a somewhat terrible algorithm (STA).

```sub find_flccos {
my @a = @{\$_[0]};
my @b = @{\$_[1]};
my (%map,@curchk,@longest);

foreach my \$i (0 .. \$#a) {
# I decided to store this map so we don't repeat this
# O(n) step if we come across the same letter again
\$map{\$a[\$i]} = [ grep \$b[\$_] eq \$a[\$i], (0 .. \$#b) ] unless define
+d \$map{\$a[\$i]};
# ok, these are the indices in b where we should
#start matching from
foreach my \$j (@{\$map{\$a[\$i]}}) {
@curchk = ();
# make temporary indices
my (\$ti,\$tj) = (\$i,\$j);
# fill @curchk with the longest current match
while (\$ti < @a && \$tj < @b && \$a[\$ti] eq \$b[\$tj]) {
push @curchk, \$a[\$ti];
\$ti++;
\$tj++;
}
# change the longest array if it is longer
#than the one found previously
@longest = @curchk if (\$#curchk > \$#longest);
}
}
return @longest;
}

__DATA__

my (@a,@b,@c,@d,@e,@f);
@a = qw( a a a a b c d );
@b = qw( a b c b a b a c a d );

@c = qw(fred bob joe jim mary elaine);
@d = qw(frank joe jim mary bob);

@e = @f = ();

print join ",", find_flccos(\@a,\@b);
print \$/;

print join ",", find_flccos(\@d,\@c);
print \$/;

print join ",", find_flccos(\@e,\@f);
print \$/;

OUTPUT:
> perl flccos.pl
a,b,c
joe,jim,mary

>

Hope this helps.

Update: Looked like a disgusting mess before I put the comments on multiple lines. Sry ;-)

antirice
The first rule of Perl club is - use Perl
The
ith rule of Perl club is - follow rule i - 1 for i > 1

You, sir, are the winner. I can't seem to break your code. I'd just like to say, this is the best website ever!
Re: Finding largest common subset in lists?
by BrowserUk (Patriarch) on Jun 05, 2003 at 09:49 UTC

Update DO NOT USE! THIS IS A TERMINALLY BROKEN ALGORITHM THAT GET THE RIGHT ANSWER SOMETIMES ALMOST BY LUCK. I DOUBT IT CAN BE EASILY FIXED.

Must test more! Must test more!

Whether this is 'cleaner' is doubtful, given this is the first time ever I've felt the need to use redo, but if your big set is very big, is should result in less total memory use and less comparisons. Ie. It should be faster, I think.

The basic premise to first locate all the subsets of @a that contain only values that exist in @b, and then use join and index to verify if the subset actually exists in @b.

```#! perl -slw
use strict;

# gen some test data (often generates warnings too)
#my @a = map{ int rand 100 } 1 .. 100; print "@a";
#my @b; push @b, @a[ (\$_ = rand(100)) .. (\$_ + rand(10)) ] for 1 .. 10
+; print "@b";

my @a = qw(fred bob joe jim mary elaine);
my @b = qw(frank joe jim mary bob);

my %b; @b{ @b } = ();
my \$b = join ',', @b ;

my @subsets;
my \$start = 0;

OUTER: {
for( \$start .. \$#a ) {
next if exists \$b{ \$a[\$_] };
if( \$_ > \$start+1 ) {
my \$a = join ',', @a[ \$start .. (\$_-1) ];
push @subsets, [ (\$_ - 1 - \$start), \$start, \$_ -1 ] if 1+i
+ndex( \$b, \$a );
}
\$start++;
redo OUTER;
}
}

my \$longest = [0];
\$_->[0] > \$longest->[0] and \$longest = \$_ for @subsets;

print "@a[ \$longest->[1] .. \$longest->[2] ]";

It finds the apppropriate subset of your sample data, and works quite quickly for various datasets I generated randomly. Full testing is left to you.

It handles duplicates in either set (collection:), and could probably be more efficient if this isn't a requirement.

Caveats:

The usual one about elements that could contain your chosen seperator which you are probably aware of.

As is, it will only find the first, longest common subset. If two or more, equally long, subsets exist and you want them all, you have a little work left to do:)

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

Re: Finding largest common subset in lists?
by gjb (Vicar) on Jun 05, 2003 at 09:52 UTC

A somewhat non-standard approach, just for the fun of it:

```my @a = qw( fred bob joe jim mary elaine );
my @b = qw( frank joe jim mary bob );

my \$str = join(" ", @a) . "&" . join(" ", @b);

if (\$str =~ /(?:\b\w+\b\s*)+?
((?:\b\w+\b\s*)+)
(?:\s*\b\w+\b\s*)+?
&
(?:\b\w+\b\s*)+?
\1
(?:\s*\b\w+\b)+?/x) {
my \$result = \$1;
\$result =~ s/\s*\$//;
print "found '\$result'\n";
}

Note: it works on the example given, I didn't test it extensively.

Just my 2 cents, -gjb-

It won't work on general data. It will find the first subsequence of @a that is in @b and be content with it. But it was an interesting attempt.

Update: For @a = qw(a b c); @b = qw(a b c) it prints b. For two element lists it does not work at all.

Update: Changed the code to:

```my \$str = join(" ", @a) . "&" . join(" ", @b);

if (\$str =~ /(?:\b\w+\b\s*)*?
((?:\b\w+\b\s*)+)
(?:\s*\b\w+\b\s*)*?
&
(?:\b\w+\b\s*)*?
\1
(?:\s*\b\w+\b)*?/x) {
my \$result = \$1;
\$result =~ s/\s*\$//;
print "found '\$result'\n";
}
It does not have the problem with cutting the first and last elements of the list, but still for @a = qw(a b c); @b = qw(a x b c) it prints a.

I also went the route of creating a regular expression, and here is what I came up with. This RE most likely has abysmal exponential performance, as it backtracks extensively. It also does not work with duplicate elements in the input stream, as a (Perl) regular expression will be content with the leftmost match - had Perl a POSIX RE engine, that engine should match the longest match from what I remember...

Anyway, the code "simply" constructs a regular expression of one list that matches all possible subsequences in order, and then matches that list against the other list as a string. Spaces are used as the delimiters between the elements.

```#!/usr/bin/perl -w
use strict;
use Test::More tests => 3;

sub lcos {
my (\$list1,\$list2) = @_;
my @list1 = @\$list1;
my @list2 = @\$list2;

# Assume that neither list1 nor list2 contain elements
# that stringify to the same value, that is, neither
# list1 nor list2 contain both "" and undef.
# Also, a blank is chosen as a delimiting character which
# shall appear in no element of list1 or list2

\$list2 = " " . join(" ", @list2) . " ";
my \$matcher = lcos_match(@list1);
#print \$matcher,"\n";
my @result;
@result = split " ", \$1
if \$list2 =~ \$matcher;
@result;
};

sub lcos_match {
my \$match = '(';
while (@_) {
\$match .= match_sequence(@_);
shift;
@_ and \$match .= "|";
};
\$match .= ')';
\$match;
};

sub match_sequence {
my \$result = "";
my \$element;
while (defined (\$element = pop)) {
\$result = \$result ? qr{ \Q\$element\E\$result?} : qr{ \Q\$element\E }
+;
};
\$result;
};

while (<DATA>) {
my (\$l1,\$l2,\$expected) = split /\s*\|\s*/;
my @l1 = split /\s*/, \$l1;
my @l2 = split /\s*/, \$l2;
my @expected = split /\s*/, \$expected;

\$" = ",";
\$, = ",";
is_deeply( [ lcos(\@l1,\@l2)], \@expected, "@l1 | @l2");
};

__DATA__
a b c d f g h j q z | a b c d e f g i j k r x y z | a b c d
a b c | a b x c | a b
a b c d | a b x b c d | b c d
Re: Finding largest common subset in lists?
by Jenda (Abbot) on Jun 05, 2003 at 15:07 UTC
This seems to work:
```use strict;

my @a = qw( fred bob joe jim mary elaine );
my @b = qw( frank joe jim mary bob );
my @result;

while (@a and @a > @result) {
my \$start = 0;
while (@b - \$start > @result) {
my \$end = 0;
my @maybe = ();
\$end++ while (exists(\$a[\$end]) and \$a[\$end] eq \$b[\$start+\$end]
+);
if (--\$end > @result) {
@result = @a[0..\$end];
}
\$start++;
}
shift(@a);
}

print "@result\n";
Note that I am destroying @a !

Jenda
Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
-- Rick Osborne

Edit by castaway: Closed small tag in signature

So close! It works almost all the time, but this test case fails:
```my @a = qw( a b c d e f g h i j k l m n o p q r s t u v w x y z );
my @b = qw( a b c X f g h X l m n X j k a b c d );
Running on this, I get a b c as the result instead of a b c d. I'm looking to see if I can figure out why now, but I figured I'd post in case someone's faster than me (likely).

I'm stupid, stupid, stupid. There should be:

```....
if (\$end > @result) {
@result = @a[0..\$end-1];
}
...
```...
if (--\$end > @result) {
@result = @a[0..\$end];
}
...
The way I have it I'd only update the @result if the newly found list was longer by at least 2 items. The usual off-by-one error :-(

Jenda
Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
-- Rick Osborne

Edit by castaway: Closed small tag in signature

Also if the lists are huge and a lot of the items only appear on one of the lists it may be better to filter out those "unique" elements before you start looking for matches. Of course you can't remove them completely, you have to keep a "marker" there that's not equal to any other element nor marker.

```use strict;
my @a = qw(fred bob joe jim ethfgh mary elaine foo bar bob foo too);
my @b = qw(frank joe jim dfkjhgdkjfg mary bob srere dfhgerg wet sdfwer
+ mary);

my (%seenA, %seenB);
{
my \$i = 0;
@seenA{@a}=undef;
@seenB{@b}=undef;
my \$last = -1;
@a = map {exists \$seenB{\$_} ? (\$last = \$_) : (defined(\$last) ? (\$l
+ast = undef) : ())} @a;
my \$last = -1;
@b = map {exists \$seenA{\$_} ? (\$last = \$_) : (defined(\$last) ? (\$l
+ast = undef) : ())} @b;
}

print "@a\n";
print "@b\n\n";

shift(@a) unless defined \$a[0];
shift(@b) unless defined \$b[0];
pop(@a) unless defined \$a[-1];
pop(@b) unless defined \$b[-1];

print "@a\n";
print "@b\n\n";

my @result;

while (@a and @a > @result) {
my \$start = 0;
while (@b - \$start > @result) {
my \$end = 0;
my @maybe = ();
\$end++ while (defined(\$a[\$end]) and defined(\$b[\$start+\$end]) a
+nd \$a[\$end] eq \$b[\$start+\$end]);
if (--\$end > @result) {
@result = @a[0..\$end];
}
\$start++;
}
shift(@a);
}

print "@result\n";

I assume the lists did not contain any undefs!

Jenda
Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
-- Rick Osborne

Edit by castaway: Closed small tag in signature

4:0 for anjiro (Re: Finding largest common subset in lists?)
by zby (Vicar) on Jun 05, 2003 at 12:03 UTC
Four answers and all wrong! This makes this question somehow exceptionall. Who will be the first one with a right answer?
Re: Finding largest common subset in lists?
by Enlil (Parson) on Jun 06, 2003 at 20:21 UTC
Here is my stab at it. (LCCOS takes the reference to two array for input and spit back a reference to an array)
```####################################
sub LCCOS {
####################################
my (\$array1,\$array2) =  @_;
my \$current_largest = [];
my \$current_size;

if ( @\$array1 > @\$array2 ) {
(\$array1,\$array2) = (\$array2,\$array1)
};

my \$count = 0;

my %elements_in_longer;

#create an HoA holding the locations of all elements
#in the larger array so we only start at those places
push @{\$elements_in_longer{\$_}},
\$count++ for @\$array1;

for my \$pos_in_small ( 0 .. \$#array2 ) {
my \$current_letter = \$array2->[\$pos_in_small];

for my \$possible_start(@{\$elements_in_longer{\$current_letter}}) {
my \$curr_pos_in_small = \$pos_in_small;
my \$curr_pos_in_large = \$possible_start;
my @curr_matching_sequence;

while ( defined \$array1->[\$curr_pos_in_large]
and
defined \$array2->[\$curr_pos_in_small] ) {
if (\$array1->[\$curr_pos_in_large]
eq
\$array2->[\$curr_pos_in_small]) {
push @curr_matching_sequence,
\$array2->[\$curr_pos_in_small];

\$curr_pos_in_large++;
\$curr_pos_in_small++;
} else  { last; }
}
if ( @curr_matching_sequence > @{\$current_largest} ) {
\$current_largest = \@curr_matching_sequence;
\$current_size = @curr_matching_sequence;
}
}

#if the \$current_size is longer that what is left to
#check in the smaller array, then we are done.
last if  \$current_size > \$#array2 - \$pos_in_small;
}
return \$current_largest;
}   ##LCCOS

-enlil

Re: Finding largest common subset in lists?
by uneson (Initiate) on Mar 31, 2005 at 09:17 UTC
Two years behind the rest of the class, I needed to solve a very similar problem. It is in fact not so computationally complex -- with dynamic programming, it seems it can be solved in O(length(@arr1)).

The algorithm below is a bit like edit distance for strings +"edit distance" +"dynamic programming" with scores instead of costs and maxima searched for at any point.

```#!/usr/bin/perl -w
use strict;
my @arr1 = qw( 123 a4 b c david e f g 8 h f g 8 X i j k l george );
my @arr2 = qw( c ¤¤ a4 b c d e f g 8 X l m k a4 b c david george );

print "array1: @arr1\n";
print "array2: @arr2\n";
my(\$longest, \$longest_inds) = find_lccs(\@arr1, \@arr2);
print
scalar @\$longest_inds,
" instance(s) of subsequence of length \$longest, starting at index
+ ",
(join ', ' => @\$longest_inds), ":\n";

foreach (@\$longest_inds) {
print "@arr1[\$_ .. \$_ + \$longest - 1]\n";
}

sub find_lccs {
my (\$arr1, \$arr2) = @_;
my %inds;
my \$longest = 0;
my @longest_inds = ();

for (my \$i = 0; \$i < @\$arr1; \$i++) {
\$inds{\$arr1->[\$i]}->{\$i} = 0;
}

for (my \$i = \$#\$arr2; \$i >= 0; \$i--) {
foreach (keys %{ \$inds{\$arr2->[\$i]} } ) {
if (defined \$arr2->[\$i+1] && exists \$inds{\$arr2->[\$i+1]}->
+{\$_ + 1}) {
\$inds{\$arr2->[\$i]}->{\$_} = \$inds{\$arr2->[\$i+1]}->{\$_ +
+ 1} + 1;
} else {
\$inds{\$arr2->[\$i]}->{\$_} =  1;
}
if (\$inds{\$arr2->[\$i]}->{\$_} > \$longest ) {
\$longest = \$inds{\$arr2->[\$i]}->{\$_};
@longest_inds = (\$_);
} elsif (\$inds{\$arr2->[\$i]}->{\$_} == \$longest ) {
push @longest_inds, \$_;
}
}
}

@longest_inds = sort {\$a <=> \$b} @longest_inds;
return (\$longest, \@longest_inds);
}

Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://263238]
Approved by EvdB
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (5)
As of 2023-02-07 08:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?