Update DO NOT USE! THIS IS A TERMINALLY BROKEN ALGORITHM THAT GET THE RIGHT ANSWER SOMETIMES ALMOST BY LUCK. I DOUBT IT CAN BE EASILY FIXED.
Must test more! Must test more!
Whether this is 'cleaner' is doubtful, given this is the first time ever I've felt the need to use redo, but if your big set is very big, is should result in less total memory use and less comparisons. Ie. It should be faster, I think.
The basic premise to first locate all the subsets of @a that contain only values that exist in @b, and then use join and index to verify if the subset actually exists in @b.
#! perl -slw use strict; # gen some test data (often generates warnings too) #my @a = map{ int rand 100 } 1 .. 100; print "@a"; #my @b; push @b, @a[ ($_ = rand(100)) .. ($_ + rand(10)) ] for 1 .. 10 +; print "@b"; my @a = qw(fred bob joe jim mary elaine); my @b = qw(frank joe jim mary bob); my %b; @b{ @b } = (); my $b = join ',', @b ; my @subsets; my $start = 0; OUTER: { for( $start .. $#a ) { next if exists $b{ $a[$_] }; if( $_ > $start+1 ) { my $a = join ',', @a[ $start .. ($_-1) ]; push @subsets, [ ($_ - 1 - $start), $start, $_ -1 ] if 1+i +ndex( $b, $a ); } $start++; redo OUTER; } } my $longest = [0]; $_->[0] > $longest->[0] and $longest = $_ for @subsets; print "@a[ $longest->[1] .. $longest->[2] ]";
It finds the apppropriate subset of your sample data, and works quite quickly for various datasets I generated randomly. Full testing is left to you.
It handles duplicates in either set (collection:), and could probably be more efficient if this isn't a requirement.
Caveats:
The usual one about elements that could contain your chosen seperator which you are probably aware of.
As is, it will only find the first, longest common subset. If two or more, equally long, subsets exist and you want them all, you have a little work left to do:)
In reply to Re: Finding largest common subset in lists?
by BrowserUk
in thread Finding largest common subset in lists?
by anjiro
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |