There's more than one way to do things PerlMonks

### Re: LCCS time complexity

by tachyon (Chancellor)
 on Sep 09, 2003 at 14:24 UTC ( #290044=note: print w/replies, xml ) Need Help??

in reply to LCCS time complexity

Maybe Algotithm::Diff will perform better. It isnt C but always seems to perform its job well and fast. If you are trying to do this on gene sequences you probably want to do it in C - once you have found an efficient algoritm. It does LCS native and can be easily made to do LCSS. Yes LCS and LCSS are different. The lcss method returns all the common substrings, unsorted. Just iterate through the list and remeber the index of the longest (much more efficient than a sort). You would want a Schwartzian transform on the sort by length if you want to use the multiple results (but if you want more than one LCSS its kinda a contradiction in terms).

NB: Edge case where two (L)CSS are same length second ignored in my simple loop.

```use Algorithm::Diff qw(LCS traverse_sequences);

my @seq1 = split //, 'abcdefghijklmnopqrstuvwxyz';
my @seq2 = split //,'flubberabcdubberdofghijklm';
my \$lcs    = join '', LCS( \@seq1, \@seq2 );
print "LCS: \$lcs\n";

my \$lcss = lcss( \@seq1, \@seq2 );
print "All CSS: @\$lcss\n";
my \$index;
my \$length = 0;
for( my \$i = 0; \$i < @\$lcss; \$i++ ) {
next unless length(\$lcss->[\$i])>\$length;
\$index = \$i;
\$length = length(\$lcss->[\$i]);
}
print "LCSS: ", \$lcss->[\$index], "\n";
# optional sort method
print "LCSS: ", (sort{length{\$a}<=>length{\$b}}@\$lcss)[-1];

sub lcss {
my ( \$seq1, \$seq2 ) = @_;
my ( @match, \$from_match );
my \$i = 0;
traverse_sequences( \$seq1, \$seq2, {
MATCH => sub { \$match[\$i] .= \$seq1->[\$_[0]]; \$from_match = 1 }
+,
DISCARD_A => sub { do{\$i++; \$from_match = 0} if \$from_match },
DISCARD_B => sub { do{\$i++; \$from_match = 0} if \$from_match },
});
return \@match;
}

__DATA__
LCS: abcdefghijklm
All CSS: abcd e fghijklm
LCSS: fghijklm
LCSS: fghijklm

cheers

tachyon

s&&rsenoyhcatreve&&&s&n.+t&"\$'\$`\$\"\$\&"&ee&&y&srve&&d&&print

Replies are listed 'Best First'.
Re: Re: LCCS time complexity
by rkg (Hermit) on Sep 09, 2003 at 16:03 UTC
This code runs so much faster my head spins. Wow. Thanks. rkg

Well actually dominus wrote the module, bikenomad maintains it and merlyn brought it to my attention. Glad it works better. Might as well make it a module....

```package String::LCSS::Fast;

use 5.006;
use strict;
use warnings;
use Algorithm::Diff qw(traverse_sequences);
require Exporter;
use vars qw( @ISA @EXPORT_OK \$VERSION );
our @ISA = qw(Exporter);
@EXPORT_OK = qw( LCSS CSS CSS_Sorted );
\$VERSION = '0.01';

sub _tokenize { [split //, \$_[0]] }

sub CSS {
my \$is_array = ref \$_[0] eq 'ARRAY' ? 1 : 0;
my \$sort = \$_[2];
my ( \$seq1, \$seq2, @match, \$from_match );
my \$i = 0;
if ( \$is_array ) {
\$seq1 = \$_[0];
\$seq2 = \$_[1];
traverse_sequences( \$seq1, \$seq2, {
MATCH => sub { push @{\$match[\$i]}, \$seq1->[\$_[0]]; \$from_m
+atch = 1 },
DISCARD_A => sub { do{\$i++; \$from_match = 0} if \$from_matc
+h },
DISCARD_B => sub { do{\$i++; \$from_match = 0} if \$from_matc
+h },
});
}
else {
\$seq1 = _tokenize(\$_[0]);
\$seq2 = _tokenize(\$_[1]);
traverse_sequences( \$seq1, \$seq2, {
MATCH => sub { \$match[\$i] .= \$seq1->[\$_[0]]; \$from_match =
+ 1 },
DISCARD_A => sub { do{\$i++; \$from_match = 0} if \$from_matc
+h },
DISCARD_B => sub { do{\$i++; \$from_match = 0} if \$from_matc
+h },
});
}
return \@match;
}

sub CSS_Sorted {
my \$match = CSS(@_);
if ( ref \$_[0] eq 'ARRAY' ) {
@\$match = map{\$_->[0]}sort{\$b->[1]<=>\$a->[1]}map{[\$_,scalar(@\$_
+)]}@\$match
}
else {
@\$match = map{\$_->[0]}sort{\$b->[1]<=>\$a->[1]}map{[\$_,length(\$_)
+]}@\$match
}
return \$match;
}

sub LCSS {
my \$is_array = ref \$_[0] eq 'ARRAY' ? 1 : 0;
my \$css = CSS(@_);
my \$index;
my \$length = 0;
if ( \$is_array ) {
for( my \$i = 0; \$i < @\$css; \$i++ ) {
next unless @{\$css->[\$i]}>\$length;
\$index = \$i;
\$length = @{\$css->[\$i]};
}
}
else {
for( my \$i = 0; \$i < @\$css; \$i++ ) {
next unless length(\$css->[\$i])>\$length;
\$index = \$i;
\$length = length(\$css->[\$i]);
}
}
return \$css->[\$index];
}

1;
__END__

String::LCSS::Fast - Perl extension for getting the Longest Common Sub
+String

use String::LCSS::Fast qw( LCSS CSS CSS_Sorted );

This module uses Algoritm::Diff to implement LCSS and is orders of mag
+nitude
faster than String::LCSS.

Returns the longest common sub string. If there may be more than one a
+nd

my \$lcss_ary_ref = LCSS( \@SEQ1, \@SEQ2 );  # ref to array
my \$lcss_string  = LCSS( \$STR1, \$STR2 );    # string

Returns all the common sub strings, unsorted.

my \$css_ary_ref = CSS( \@SEQ1, \@SEQ2 );  # ref to array of arrays
my \$css_str_ref = CSS( \$STR1, \$STR2 );    # ref to array of string
+s

Returns all the common sub strings, sorted from longest to shortest

my \$css_ary_ref = CSS( \@SEQ1, \@SEQ2 );  # ref to array of arrays
my \$css_str_ref = CSS( \$STR1, \$STR2 );    # ref to array of string
+s

None by default.

Dr James Freeman <james.freeman@id3.org.uk>

L<perl>.

=cut

cheers

tachyon

s&&rsenoyhcatreve&&&s&n.+t&"\$'\$`\$\"\$\&"&ee&&y&srve&&d&&print

Thanks, tachyon.

For my application, I needed maximal runs of identical words between two strings. I'm loose about what comprises a word. For example, for my application "hi-res" is the same as "hires", and punctuation and case don't matter. In the spirit of TIMTOWTDI, here's what I scraped together:

```# FRAGMENT

my \$lcs = lcss(standardize(\$x1), standardize(\$x2);

sub lcss {
my (\$str1, \$str2 ) = @_;
my @match = ();
my @longest = ();
my \$i = 0;
my \$seq1 = [split(/\s+/, \$str1)];
my \$seq2 = [split (/\s+/, \$str2)];
my \$sub = sub {
@longest = map {\$_} @match if (@match >= @longest);
@match = ();
};
traverse_sequences( \$seq1, \$seq2, {
MATCH => sub {push(@match, \$seq1->[\$_[0]]);},
});
my \$lcs = join(' ', @longest);
return \$lcs;
}

# lowercase and remove odd characters
sub standardize {
my (\$text) = @_;
return unless \$text;
\$text =~ s/\[.*?\]/ /g;
\$text =~ s/[.,?"':&()!-]/ /g;
\$text =~ s/[^\w ]//g;
\$text =~ s/^\s+//;
\$text =~ s/\s+\$//;
\$text =~ s/\s+/ /;
\$text = lc \$text;
return \$text;
}
rkg

Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://290044]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (4)
As of 2023-02-07 11:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?