comment on

Update DO NOT USE! THIS IS A TERMINALLY BROKEN ALGORITHM THAT GET THE RIGHT ANSWER SOMETIMES ALMOST BY LUCK. I DOUBT IT CAN BE EASILY FIXED.

Must test more! Must test more!

Whether this is 'cleaner' is doubtful, given this is the first time ever I've felt the need to use redo, but if your big set is very big, is should result in less total memory use and less comparisons. Ie. It should be faster, I think.

The basic premise to first locate all the subsets of @a that contain only values that exist in @b, and then use join and index to verify if the subset actually exists in @b.

#! perl -slw
use strict;

# gen some test data (often generates warnings too)
#my @a = map{ int rand 100 } 1 .. 100; print "@a";
#my @b; push @b, @a[ ($_ = rand(100)) .. ($_ + rand(10)) ] for 1 .. 10
+; print "@b";

my @a = qw(fred bob joe jim mary elaine);
my @b = qw(frank joe jim mary bob);

my %b; @b{ @b } = ();
my $b = join ',', @b ;

my @subsets;
my $start = 0;

OUTER: {
    for( $start .. $#a ) {
        next if exists $b{ $a[$_] };
        if( $_ > $start+1 ) {
            my $a = join ',', @a[ $start .. ($_-1) ];
            push @subsets, [ ($_ - 1 - $start), $start, $_ -1 ] if 1+i
+ndex( $b, $a );
        }
        $start++;
        redo OUTER;
    }
}

my $longest = [0];
$_->[0] > $longest->[0] and $longest = $_ for @subsets;


print "@a[ $longest->[1] .. $longest->[2] ]";
[download]

It finds the apppropriate subset of your sample data, and works quite quickly for various datasets I generated randomly. Full testing is left to you.

It handles duplicates in either set (collection:), and could probably be more efficient if this isn't a requirement.

Caveats:

The usual one about elements that could contain your chosen seperator which you are probably aware of.

As is, it will only find the first, longest common subset. If two or more, equally long, subsets exist and you want them all, you have a little work left to do:)

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

In reply to Re: Finding largest common subset in lists? by BrowserUk
in thread Finding largest common subset in lists? by anjiro

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.