Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,

I have been searching far and wide for a true brute force code or algorithm to find the longest common subsequence between two strings for my research project and have not been so lucky. I have seen a few code here in Perlmonks but they are not truely brute force so it will make my measurements inaccurate, so I am turning to Perl monks for some wisdoms. I read an article on the web and it said that for two strings S1 and S2, assuming S1 is the longer string, I need to enumerate all possible subsequences in S1. Let's assume S1="abcde" and S2="abe" then to enumerate all possible subsequences of S1, what do I need to do? My first guess would be to use the split function to split S1 into individual character but then unsure of what to do next. Any helps would be appreciated.
Thanks in advance.

Also, I found the following algorithm on the web and it claims to be brute force but I am having a hard time understand what it actually do to implement it in Perl so would someone please explain it.
lcs is longest common subsequence algorithm lcs algorithm in c brute-force approach / input: array a of x/ / input: array b of x/ / output: pointer to beginning of longest sequence/ x * c[]; int a_counter; int b_counter; int c_counter; exists bool; int longest_sequence; int longest_sequence_index; / generate array of unique elements from a called c/ for (a_counter = a_start; a_counter < size_of_a; a_counter++) { c_counter = c_start; while ( c_counter < size_of_c ) { if (c[c_counter] == a[a_counter] ) { exists = true; break; } c_counter++; } if (exists) continue; else c[a_counter++] = a[a_counter]; exists = false; } / generate max_c array to hold max entries of c array/ /* (could also implement d as linked list)*/ int max_c[size_of_c]; for ( c_counter = c_start; c_counter < size_of_c; c_counter++) { max_c[c_counter] = 0; } / compare a[i] with ending elements of each c[0 ... size_of_c].final/ / elements/ for (b_counter = b_start+1; b_counter <= size_of_b; b_counter++) { for (c_counter = c_start; c_counter < size_of_c; c_counter++) +{ if (c[c_counter][max_c[c_counter]] == b[b_counter-1]) +{ c[c_counter][max_c[c_counter]] = b[b_counter]; max_c[c_counter]++; } } } /* find longest sequence */ int longest_sequence; for (c_counter = c_start; c_counter < size_of_c; c_counter++) { if (longest_sequence < max_c[c_counter]) { longest_sequence = max_c[c_counter]; longest_sequence_index = c_counter; } } return c[longest_sequence_index];

Replies are listed 'Best First'.
Re: True Brute Force Longest Common Sub Sequence
by ikegami (Patriarch) on Nov 25, 2007 at 09:18 UTC

    Brute force would involve the following steps:
    1. Find all 2n subsequences of s1.
    2. Find all 2n subsequences of s2.
    3. Find all common subsequences of s1 and s2.
    4. Find the longest of the common subsequences.

    use strict; use warnings; sub subseqs { my @subseqs = (''); while ($_[0] =~ /(.)/sg) { push @subseqs, map "$_$1", @subseqs; } return \@subseqs; } sub common { my ($list1, $list2) = @_; my %common; for my $i1 (0..$#$list1) { for my $i2 (0..$#$list2) { if ($list1->[$i1] eq $list2->[$i2]) { $common{$list1->[$i1]} = 1; } }} return keys %common; } sub longest { my $longest = shift(@_); for (@_) { if (length($_) > length($longest)) { $longest = $_; } } return $longest; } { my $s1 = 'abdef'; my $s2 = 'abcef'; my $longest = longest(common(subseqs($s1), subseqs($s2))); print("$longest\n"); }

    Tested.

    Update: Added optimized common as an alternative.
    Update: Removed optimized common. It actually reduced the complexity as mentioned below.

      If you just want to use brute force then using a hash is really cheating as it is already the begining of optimisation. We need nasty embedded loops to go O(2^n*2^m). I think something like this represents to worst way to do it, first generating all possible subsequences and then testing every case. This does however seem to be the point.

      use strict; sub subseqs { my $str = $_[0]; my @seqs = (''); for my $char (split //, $str) { for my $i ( 0..@seqs-1 ) { push @seqs, $seqs[$i].$char; } } return @seqs; } my $s1 = 'abcdefgh'; my $s2 = 'abecdgh'; my $best_len = 0; my $longest; for my $seq1 (subseqs($s1)) { for my $seq2 (subseqs($s2)) { next unless $seq1 eq $seq2; if (length($seq1) > $best_len) { $longest = $seq1; $best_len = length($seq1); } } } print $longest;

      Note that there is an edge case when 2 or more common subsequences share the same length only the first will be found.

        If you just want to use brute force then using a hash is really cheating as it is already the begining of optimisation.

        Have another look. That claim just isn't true.

        common still visits every single combination of inputs, so it's brute force. Said nasty embedded loops *are there* and the hash doesn't affect the complexity.

        common uses a hash to remove dups, something that isn't even necessary. It actually slows things down, if anything.

        Even with the optimized common, said nasty embedded loops *are there* overall and the using the optimized common doesn't affect the overall complexity.

        Your analysis doesn't take into account $seqs[$i] . $char and $seq1 eq $seq2. They are both based on N and/or M. $seqs[$i] . $char won't affect the final O() because it's higher up in the loop, but $seq1 eq $seq2 is in the innermost loop.

        for my $seq1 (subseqs($s1)) { for my $seq2 (subseqs($s2)) {

        could be written as

        my @s1 = subseqs($s1); my @s2 = subseqs($s2); for my $seq1 (@s1) { for my $seq2 (@s2) {

        for faster code of the same complexity.

Re: True Brute Force Longest Common Sub Sequence
by tachyon-II (Chaplain) on Nov 25, 2007 at 07:41 UTC

    Here is something basic in Perl to get you started. It is unclear if you mean substrings of subsequences. This deals with substrings. To understand it you will need to read up on what the substr(), push and sort functions do.

    my $s1 = 'abcdefghijklmnopqrstuvwxyzabcdefgab'; my $s2 = 'abcdefgab'; my $MIN_LENGTH = 2; my @subseq = (); for my $i (0..length($s2)-$MIN_LENGTH) { for my $j ($MIN_LENGTH..length($s2)-$i) { push @subseq, substr($s2,$i,$j); } } print "$_\n" for @subseq; @subseq = sort{length($b)<=>length($a)}@subseq; SUBSEQ: for my $subseq (@subseq) { for my $i (0..length($s1)-length($subseq)) { if ($subseq eq substr($s1,$i,length($subseq))) { print "Match $subseq @ $i\n"; #last SUBSEQ } } }
      Thanks for the code. Why do $MIN_LENGTH = 2 instead of 1 ?
Re: True Brute Force Longest Common Sub Sequence
by KurtSchwind (Chaplain) on Nov 25, 2007 at 06:28 UTC

    If I understand your question correctly, you want to know how many subsequences are in a given sequence? So for a S1="abcde" you can have subsequences of a,ab,abc,abcd,abcde,b,bc,bcd,bcde,c,cd,cde,d,de,e. Is that right?

    Then it's a straight function of length. Because a subsequence can start on any letter and run from between 1 character until the rest of the length long.

    In the previous example the sequence is 5 long. And there are 5 subsequences beginning with a. There are 4 subsequences starting with b. 3 with c and so on. Number of subsequences: 5 + 4 + 3 + 2 + 1

    So the number of subsequences is the sum from 1 to length. 1 + 2 + 3 + . . . + lengthOfString = # of subsequences.

    --
    I used to drive a Heisenbergmobile, but every time I looked at the speedometer, I got lost.
      Subsequencea can also be ac, acd, ace, ade, adh, etc. all possible permutations of string S1, should be 2^n possible subsequences. How would I obtain this using Perl?
        You could try this:

        $ cat 652805.pl #!/usr/bin/perl -w # # Generate all possible subsequences of sequence 1 # use strict; use warnings; # Sequence 1 my @seq_1 = ('a', 'b', 'c', 'd', 'e'); # Count of all possible subsequences my $s1_max = 2 ** @seq_1; print "Number of possible subsequences: ", $s1_max,"\n"; # To get each possible subsequence, use a binary number # as a mask to tell you which digits to print for my $seq (0 .. $s1_max-1) { print "Sequence ", $seq, ": "; my $bit = 0; while ($bit < @seq_1) { print $seq_1[$bit], " " if 2**$bit & $seq; ++$bit; } print "\n"; }
        Not the most efficient solution, but I tried to make it easy to read for you...

        ...roboticus

        No, it cannot.

        "Sequence" means that you cannot skip characters in your string. Otherwise it would just be permutations and not sequences.

        See BrowserUK's comments hereafter.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        Hrm. I wouldn't call that a sequence, but ok. You problem is still a function of length and nothing more.

        It's the sum of 2 ^ (remaining length), from 1 to length. Since we can have any character on or off (essentially).

        So you have something length 5, for example. That'd be 1 * 2^0 + 2 * 2 ^1 + 3 * 2^2 + 4 * 2 ^ 3 + 5 * 2 ^4. Just looking at that sequence should give you a good idea of how to right a loop around it.

        --
        I used to drive a Heisenbergmobile, but every time I looked at the speedometer, I got lost.
Re: True Brute Force Longest Common Sub Sequence
by lima1 (Curate) on Nov 25, 2007 at 11:21 UTC
    Sounds really like an XY-problem. What do you want to do? How many sequences do you have? How long are they? Here are a lot of bioinformatics people, maybe we know a better solution...
Re: True Brute Force Longest Common Sub Sequence
by Limbic~Region (Chancellor) on Nov 25, 2007 at 22:10 UTC