True Brute Force Longest Common Sub Sequence

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,

I have been searching far and wide for a true brute force code or algorithm to find the longest common subsequence between two strings for my research project and have not been so lucky. I have seen a few code here in Perlmonks but they are not truely brute force so it will make my measurements inaccurate, so I am turning to Perl monks for some wisdoms. I read an article on the web and it said that for two strings S1 and S2, assuming S1 is the longer string, I need to enumerate all possible subsequences in S1. Let's assume S1="abcde" and S2="abe" then to enumerate all possible subsequences of S1, what do I need to do? My first guess would be to use the split function to split S1 into individual character but then unsure of what to do next. Any helps would be appreciated.
Thanks in advance.

Also, I found the following algorithm on the web and it claims to be brute force but I am having a hard time understand what it actually do to implement it in Perl so would someone please explain it.

lcs is longest common subsequence algorithm
lcs algorithm in c
brute-force approach
/ input: array a of x/
/ input: array b of x/
/ output: pointer to beginning of longest sequence/
x * c[];
int a_counter;
int b_counter;
int c_counter;
exists bool;
int longest_sequence;
int longest_sequence_index;
/ generate array of unique elements from a called c/
for (a_counter = a_start; a_counter < size_of_a; a_counter++) {

        c_counter = c_start;
        while ( c_counter < size_of_c ) {
                if (c[c_counter] == a[a_counter] ) {
                        exists = true;
                        break;
                }
                c_counter++;
        }
        if (exists) continue;
        else c[a_counter++] = a[a_counter];
        exists = false;

}
/ generate max_c array to hold max entries of c array/
/* (could also implement d as linked list)*/
int max_c[size_of_c];
for ( c_counter = c_start; c_counter < size_of_c; c_counter++) {

max_c[c_counter] = 0;
}
/ compare a[i] with ending elements of each c[0 ... size_of_c].final/
/ elements/
for (b_counter = b_start+1; b_counter <= size_of_b; b_counter++) {

        for (c_counter = c_start; c_counter < size_of_c; c_counter++) 
+{
                if (c[c_counter][max_c[c_counter]] == b[b_counter-1]) 
+{
                        c[c_counter][max_c[c_counter]] = b[b_counter];
                        max_c[c_counter]++;
                }
        }

}
/* find longest sequence */
int longest_sequence;
for (c_counter = c_start; c_counter < size_of_c; c_counter++) {

        if (longest_sequence < max_c[c_counter]) {
                longest_sequence = max_c[c_counter];    
                longest_sequence_index = c_counter;     
        }

}
return c[longest_sequence_index];
[download]

Comment on True Brute Force Longest Common Sub Sequence Download Code

Replies are listed 'Best First'.
Re: True Brute Force Longest Common Sub Sequence by ikegami (Patriarch) on Nov 25, 2007 at 09:18 UTC
Brute force would involve the following steps: 1. Find all 2ⁿ subsequences of s1. 2. Find all 2ⁿ subsequences of s2. 3. Find all common subsequences of s1 and s2. 4. Find the longest of the common subsequences. use strict; use warnings; sub subseqs { my @subseqs = (''); while ($_[0] =~ /(.)/sg) { push @subseqs, map "$_$1", @subseqs; } return \@subseqs; } sub common { my ($list1, $list2) = @_; my %common; for my $i1 (0..$#$list1) { for my $i2 (0..$#$list2) { if ($list1->[$i1] eq $list2->[$i2]) { $common{$list1->[$i1]} = 1; } }} return keys %common; } sub longest { my $longest = shift(@_); for (@_) { if (length($_) > length($longest)) { $longest = $_; } } return $longest; } { my $s1 = 'abdef'; my $s2 = 'abcef'; my $longest = longest(common(subseqs($s1), subseqs($s2))); print("$longest\n"); } [download] Tested. Update: Added optimized `common` as an alternative. Update: Removed optimized `common`. It actually reduced the complexity as mentioned below.	[reply] [d/l] [select]
Re^2: True Brute Force Longest Common Sub Sequence by tachyon-II (Chaplain) on Nov 25, 2007 at 15:50 UTC
If you just want to use brute force then using a hash is really cheating as it is already the begining of optimisation. We need nasty embedded loops to go O(2^n*2^m). I think something like this represents to worst way to do it, first generating all possible subsequences and then testing every case. This does however seem to be the point. `use strict; sub subseqs { my $str = $_[0]; my @seqs = (''); for my $char (split //, $str) { for my $i ( 0..@seqs-1 ) { push @seqs, $seqs[$i].$char; } } return @seqs; } my $s1 = 'abcdefgh'; my $s2 = 'abecdgh'; my $best_len = 0; my $longest; for my $seq1 (subseqs($s1)) { for my $seq2 (subseqs($s2)) { next unless $seq1 eq $seq2; if (length($seq1) > $best_len) { $longest = $seq1; $best_len = length($seq1); } } } print $longest;` [download] Note that there is an edge case when 2 or more common subsequences share the same length only the first will be found.	[reply] [d/l]
Re^3: True Brute Force Longest Common Sub Sequence by ikegami (Patriarch) on Nov 25, 2007 at 18:14 UTC
If you just want to use brute force then using a hash is really cheating as it is already the begining of optimisation. ~~Have another look. That claim just isn't true.~~ ~~`common` still visits every single combination of inputs, so it's brute force. Said nasty embedded loops are there and the hash doesn't affect the complexity.~~ ~~`common` uses a hash to remove dups, something that isn't even necessary. It actually slows things down, if anything.~~ ~~Even with the optimized `common`, said nasty embedded loops are there overall and the using the optimized `common` doesn't affect the overall complexity.~~	[reply] [d/l] [select]
Re^4: True Brute Force Longest Common Sub Sequence by tachyon-II (Chaplain) on Nov 26, 2007 at 00:31 UTC
Re^5: True Brute Force Longest Common Sub Sequence by ikegami (Patriarch) on Nov 26, 2007 at 06:35 UTC
Some notes below your chosen depth have not been shown here
Re^3: True Brute Force Longest Common Sub Sequence by ikegami (Patriarch) on Nov 28, 2007 at 01:56 UTC
Your analysis doesn't take into account `$seqs[$i] . $char` and `$seq1 eq $seq2`. They are both based on N and/or M. `$seqs[$i] . $char` won't affect the final O() because it's higher up in the loop, but `$seq1 eq $seq2` is in the innermost loop. `for my $seq1 (subseqs($s1)) { for my $seq2 (subseqs($s2)) {` [download] could be written as `my @s1 = subseqs($s1); my @s2 = subseqs($s2); for my $seq1 (@s1) { for my $seq2 (@s2) {` [download] for faster code of the same complexity.	[reply] [d/l] [select]
Re: True Brute Force Longest Common Sub Sequence by tachyon-II (Chaplain) on Nov 25, 2007 at 07:41 UTC
Here is something basic in Perl to get you started. It is unclear if you mean substrings of subsequences. This deals with substrings. To understand it you will need to read up on what the substr(), push and sort functions do. `my $s1 = 'abcdefghijklmnopqrstuvwxyzabcdefgab'; my $s2 = 'abcdefgab'; my $MIN_LENGTH = 2; my @subseq = (); for my $i (0..length($s2)-$MIN_LENGTH) { for my $j ($MIN_LENGTH..length($s2)-$i) { push @subseq, substr($s2,$i,$j); } } print "$_\n" for @subseq; @subseq = sort{length($b)<=>length($a)}@subseq; SUBSEQ: for my $subseq (@subseq) { for my $i (0..length($s1)-length($subseq)) { if ($subseq eq substr($s1,$i,length($subseq))) { print "Match $subseq @ $i\n"; #last SUBSEQ } } }` [download]	[reply] [d/l]
Re^2: True Brute Force Longest Common Sub Sequence by Anonymous Monk on Nov 25, 2007 at 08:38 UTC
Thanks for the code. Why do $MIN_LENGTH = 2 instead of 1 ?	[reply]
Re: True Brute Force Longest Common Sub Sequence by KurtSchwind (Chaplain) on Nov 25, 2007 at 06:28 UTC
If I understand your question correctly, you want to know how many subsequences are in a given sequence? So for a S1="abcde" you can have subsequences of a,ab,abc,abcd,abcde,b,bc,bcd,bcde,c,cd,cde,d,de,e. Is that right? Then it's a straight function of length. Because a subsequence can start on any letter and run from between 1 character until the rest of the length long. In the previous example the sequence is 5 long. And there are 5 subsequences beginning with a. There are 4 subsequences starting with b. 3 with c and so on. Number of subsequences: 5 + 4 + 3 + 2 + 1 So the number of subsequences is the sum from 1 to length. 1 + 2 + 3 + . . . + lengthOfString = # of subsequences. -- I used to drive a Heisenbergmobile, but every time I looked at the speedometer, I got lost.	[reply]
Re^2: True Brute Force Longest Common Sub Sequence by Anonymous Monk on Nov 25, 2007 at 06:51 UTC
Subsequencea can also be ac, acd, ace, ade, adh, etc. all possible permutations of string S1, should be 2^n possible subsequences. How would I obtain this using Perl?	[reply]
Re^3: True Brute Force Longest Common Sub Sequence by roboticus (Chancellor) on Nov 25, 2007 at 12:58 UTC
You could try this: $ cat 652805.pl #!/usr/bin/perl -w # # Generate all possible subsequences of sequence 1 # use strict; use warnings; # Sequence 1 my @seq_1 = ('a', 'b', 'c', 'd', 'e'); # Count of all possible subsequences my $s1_max = 2 @seq_1; print "Number of possible subsequences: ", $s1_max,"\n"; # To get each possible subsequence, use a binary number # as a mask to tell you which digits to print for my $seq (0 .. $s1_max-1) { print "Sequence ", $seq, ": "; my $bit = 0; while ($bit < @seq_1) { print $seq_1[$bit], " " if 2$bit & $seq; ++$bit; } print "\n"; } [download] Not the most efficient solution, but I tried to make it easy to read for you... ...roboticus	[reply] [d/l]
Re^3: True Brute Force Longest Common Sub Sequence by CountZero (Bishop) on Nov 25, 2007 at 10:28 UTC
No, it cannot. "Sequence" means that you cannot skip characters in your string. Otherwise it would just be permutations and not sequences. See BrowserUK's comments hereafter. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]
Re^4: True Brute Force Longest Common Sub Sequence by BrowserUk (Patriarch) on Nov 25, 2007 at 11:04 UTC
Re^5: True Brute Force Longest Common Sub Sequence by CountZero (Bishop) on Nov 25, 2007 at 13:24 UTC
Re^3: True Brute Force Longest Common Sub Sequence by KurtSchwind (Chaplain) on Nov 25, 2007 at 13:13 UTC
Hrm. I wouldn't call that a sequence, but ok. You problem is still a function of length and nothing more. It's the sum of 2 ^ (remaining length), from 1 to length. Since we can have any character on or off (essentially). So you have something length 5, for example. That'd be `1 * 2^0 + 2 * 2 ^1 + 3 * 2^2 + 4 * 2 ^ 3 + 5 * 2 ^4`. Just looking at that sequence should give you a good idea of how to right a loop around it. -- I used to drive a Heisenbergmobile, but every time I looked at the speedometer, I got lost.	[reply] [d/l]
Re: True Brute Force Longest Common Sub Sequence by lima1 (Curate) on Nov 25, 2007 at 11:21 UTC
Sounds really like an XY-problem. What do you want to do? How many sequences do you have? How long are they? Here are a lot of bioinformatics people, maybe we know a better solution...	[reply]
Re: True Brute Force Longest Common Sub Sequence by Limbic~Region (Chancellor) on Nov 25, 2007 at 22:10 UTC
Anonymous Monk, I contributed a brute force alternative when I wrote Longest Common Subsequence as a comparison. Does it not do what you want? Cheers - L~R	[reply]