Re: True Brute Force Longest Common Sub Sequence
by ikegami (Patriarch) on Nov 25, 2007 at 09:18 UTC
|
Brute force would involve the following steps:
1. Find all 2n subsequences of s1.
2. Find all 2n subsequences of s2.
3. Find all common subsequences of s1 and s2.
4. Find the longest of the common subsequences.
use strict;
use warnings;
sub subseqs {
my @subseqs = ('');
while ($_[0] =~ /(.)/sg) {
push @subseqs, map "$_$1", @subseqs;
}
return \@subseqs;
}
sub common {
my ($list1, $list2) = @_;
my %common;
for my $i1 (0..$#$list1) {
for my $i2 (0..$#$list2) {
if ($list1->[$i1] eq $list2->[$i2]) {
$common{$list1->[$i1]} = 1;
}
}}
return keys %common;
}
sub longest {
my $longest = shift(@_);
for (@_) {
if (length($_) > length($longest)) {
$longest = $_;
}
}
return $longest;
}
{
my $s1 = 'abdef';
my $s2 = 'abcef';
my $longest = longest(common(subseqs($s1), subseqs($s2)));
print("$longest\n");
}
Tested.
Update: Added optimized common as an alternative.
Update: Removed optimized common. It actually reduced the complexity as mentioned below.
| [reply] [d/l] [select] |
|
|
If you just want to use brute force then using a hash is really cheating as it is already the begining of optimisation. We need nasty embedded loops to go O(2^n*2^m). I think something like this represents to worst way to do it, first generating all possible subsequences and then testing every case. This does however seem to be the point.
use strict;
sub subseqs {
my $str = $_[0];
my @seqs = ('');
for my $char (split //, $str) {
for my $i ( 0..@seqs-1 ) {
push @seqs, $seqs[$i].$char;
}
}
return @seqs;
}
my $s1 = 'abcdefgh';
my $s2 = 'abecdgh';
my $best_len = 0;
my $longest;
for my $seq1 (subseqs($s1)) {
for my $seq2 (subseqs($s2)) {
next unless $seq1 eq $seq2;
if (length($seq1) > $best_len) {
$longest = $seq1;
$best_len = length($seq1);
}
}
}
print $longest;
Note that there is an edge case when 2 or more common subsequences share the same length only the first will be found. | [reply] [d/l] |
|
|
If you just want to use brute force then using a hash is really cheating as it is already the begining of optimisation.
Have another look. That claim just isn't true.
common still visits every single combination of inputs, so it's brute force. Said nasty embedded loops *are there* and the hash doesn't affect the complexity.
common uses a hash to remove dups, something that isn't even necessary. It actually slows things down, if anything.
Even with the optimized common, said nasty embedded loops *are there* overall and the using the optimized common doesn't affect the overall complexity.
| [reply] [d/l] [select] |
|
|
|
|
|
|
|
for my $seq1 (subseqs($s1)) {
for my $seq2 (subseqs($s2)) {
could be written as
my @s1 = subseqs($s1);
my @s2 = subseqs($s2);
for my $seq1 (@s1) {
for my $seq2 (@s2) {
for faster code of the same complexity.
| [reply] [d/l] [select] |
Re: True Brute Force Longest Common Sub Sequence
by tachyon-II (Chaplain) on Nov 25, 2007 at 07:41 UTC
|
Here is something basic in Perl to get you started. It is unclear if you mean substrings of subsequences. This deals with substrings. To understand it you will need to read up on what the substr(), push and sort functions do.
my $s1 = 'abcdefghijklmnopqrstuvwxyzabcdefgab';
my $s2 = 'abcdefgab';
my $MIN_LENGTH = 2;
my @subseq = ();
for my $i (0..length($s2)-$MIN_LENGTH) {
for my $j ($MIN_LENGTH..length($s2)-$i) {
push @subseq, substr($s2,$i,$j);
}
}
print "$_\n" for @subseq;
@subseq = sort{length($b)<=>length($a)}@subseq;
SUBSEQ:
for my $subseq (@subseq) {
for my $i (0..length($s1)-length($subseq)) {
if ($subseq eq substr($s1,$i,length($subseq))) {
print "Match $subseq @ $i\n";
#last SUBSEQ
}
}
}
| [reply] [d/l] |
|
|
Thanks for the code. Why do $MIN_LENGTH = 2 instead of 1 ?
| [reply] |
Re: True Brute Force Longest Common Sub Sequence
by KurtSchwind (Chaplain) on Nov 25, 2007 at 06:28 UTC
|
If I understand your question correctly, you want to know how many subsequences are in a given sequence? So for a S1="abcde" you can have subsequences of a,ab,abc,abcd,abcde,b,bc,bcd,bcde,c,cd,cde,d,de,e. Is that right?
Then it's a straight function of length. Because a subsequence can start on any letter and run from between 1 character until the rest of the length long.
In the previous example the sequence is 5 long. And there are 5 subsequences beginning with a. There are 4 subsequences starting with b. 3 with c and so on. Number of subsequences: 5 + 4 + 3 + 2 + 1
So the number of subsequences is the sum from 1 to length. 1 + 2 + 3 + . . . + lengthOfString = # of subsequences.
--
I used to drive a Heisenbergmobile, but every time I looked at the speedometer, I got lost.
| [reply] |
|
|
Subsequencea can also be ac, acd, ace, ade, adh, etc. all possible permutations of string S1, should be 2^n possible subsequences. How would I obtain this using Perl?
| [reply] |
|
|
$ cat 652805.pl
#!/usr/bin/perl -w
#
# Generate all possible subsequences of sequence 1
#
use strict;
use warnings;
# Sequence 1
my @seq_1 = ('a', 'b', 'c', 'd', 'e');
# Count of all possible subsequences
my $s1_max = 2 ** @seq_1;
print "Number of possible subsequences: ", $s1_max,"\n";
# To get each possible subsequence, use a binary number
# as a mask to tell you which digits to print
for my $seq (0 .. $s1_max-1) {
print "Sequence ", $seq, ": ";
my $bit = 0;
while ($bit < @seq_1) {
print $seq_1[$bit], " " if 2**$bit & $seq;
++$bit;
}
print "\n";
}
Not the most efficient solution, but I tried to make it easy to read for you...
...roboticus
| [reply] [d/l] |
|
|
| [reply] |
|
|
|
|
|
|
Hrm. I wouldn't call that a sequence, but ok. You problem is still a function of length and nothing more.
It's the sum of 2 ^ (remaining length), from 1 to length. Since we can have any character on or off (essentially).
So you have something length 5, for example. That'd be 1 * 2^0 + 2 * 2 ^1 + 3 * 2^2 + 4 * 2 ^ 3 + 5 * 2 ^4. Just looking at that sequence should give you a good idea of how to right a loop around it.
--
I used to drive a Heisenbergmobile, but every time I looked at the speedometer, I got lost.
| [reply] [d/l] |
Re: True Brute Force Longest Common Sub Sequence
by lima1 (Curate) on Nov 25, 2007 at 11:21 UTC
|
Sounds really like an XY-problem. What do you want to do? How many sequences do you have? How long are they? Here are a lot of bioinformatics people, maybe we know a better solution... | [reply] |
Re: True Brute Force Longest Common Sub Sequence
by Limbic~Region (Chancellor) on Nov 25, 2007 at 22:10 UTC
|
| [reply] |