Re: Re: Re: LCCS time complexity

in reply to Re: Re: LCCS time complexity
in thread LCCS time complexity

Well actually dominus wrote the module, bikenomad maintains it and merlyn brought it to my attention. Glad it works better. Might as well make it a module....

package String::LCSS::Fast;

use 5.006;
use strict;
use warnings;
use Algorithm::Diff qw(traverse_sequences);
require Exporter;
use vars qw( @ISA @EXPORT_OK $VERSION );
our @ISA = qw(Exporter);
@EXPORT_OK = qw( LCSS CSS CSS_Sorted );
$VERSION = '0.01';

sub _tokenize { [split //, $_[0]] }

sub CSS {
    my $is_array = ref $_[0] eq 'ARRAY' ? 1 : 0;
    my $sort = $_[2];
    my ( $seq1, $seq2, @match, $from_match );
    my $i = 0;
    if ( $is_array ) {
        $seq1 = $_[0];
        $seq2 = $_[1];
        traverse_sequences( $seq1, $seq2, {
            MATCH => sub { push @{$match[$i]}, $seq1->[$_[0]]; $from_m
+atch = 1 },
            DISCARD_A => sub { do{$i++; $from_match = 0} if $from_matc
+h },
            DISCARD_B => sub { do{$i++; $from_match = 0} if $from_matc
+h },
        });
    }
    else {
        $seq1 = _tokenize($_[0]);
        $seq2 = _tokenize($_[1]);
        traverse_sequences( $seq1, $seq2, {
            MATCH => sub { $match[$i] .= $seq1->[$_[0]]; $from_match =
+ 1 },
            DISCARD_A => sub { do{$i++; $from_match = 0} if $from_matc
+h },
            DISCARD_B => sub { do{$i++; $from_match = 0} if $from_matc
+h },
        });
    }
  return \@match;
}

sub CSS_Sorted {
    my $match = CSS(@_);
    if ( ref $_[0] eq 'ARRAY' ) {
       @$match = map{$_->[0]}sort{$b->[1]<=>$a->[1]}map{[$_,scalar(@$_
+)]}@$match
    }
    else {
       @$match = map{$_->[0]}sort{$b->[1]<=>$a->[1]}map{[$_,length($_)
+]}@$match
    }
  return $match;
}

sub LCSS {
    my $is_array = ref $_[0] eq 'ARRAY' ? 1 : 0;
    my $css = CSS(@_);
    my $index;
    my $length = 0;
    if ( $is_array ) {
        for( my $i = 0; $i < @$css; $i++ ) {
            next unless @{$css->[$i]}>$length;
            $index = $i;
            $length = @{$css->[$i]};
        }
    }
    else {
        for( my $i = 0; $i < @$css; $i++ ) {
            next unless length($css->[$i])>$length;
            $index = $i;
            $length = length($css->[$i]);
        }
    }
  return $css->[$index];
}

1;
__END__

=head1 NAME

String::LCSS::Fast - Perl extension for getting the Longest Common Sub
+String

=head1 SYNOPSIS

  use String::LCSS::Fast qw( LCSS CSS CSS_Sorted );

=head1 DESCRIPTION

This module uses Algoritm::Diff to implement LCSS and is orders of mag
+nitude
faster than String::LCSS.

=head1 METHODS

=head2 LCSS

Returns the longest common sub string. If there may be more than one a
+nd
it matters use CSS instead.

    my $lcss_ary_ref = LCSS( \@SEQ1, \@SEQ2 );  # ref to array
    my $lcss_string  = LCSS( $STR1, $STR2 );    # string

=head2 CSS

Returns all the common sub strings, unsorted.

    my $css_ary_ref = CSS( \@SEQ1, \@SEQ2 );  # ref to array of arrays
    my $css_str_ref = CSS( $STR1, $STR2 );    # ref to array of string
+s

=head2 CSS_Sorted

Returns all the common sub strings, sorted from longest to shortest

    my $css_ary_ref = CSS( \@SEQ1, \@SEQ2 );  # ref to array of arrays
    my $css_str_ref = CSS( $STR1, $STR2 );    # ref to array of string
+s

=head1 EXPORT

None by default.

=head1 AUTHOR

Dr James Freeman <james.freeman@id3.org.uk>

=head1 SEE ALSO

L<perl>.

=cut
[download]

cheers

tachyon

s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Comment on Re: Re: Re: LCCS time complexity Download Code

Replies are listed 'Best First'.
Re: TIMTOWTDI by rkg (Hermit) on Sep 10, 2003 at 10:52 UTC
Thanks, tachyon. For my application, I needed maximal runs of identical words between two strings. I'm loose about what comprises a word. For example, for my application "hi-res" is the same as "hires", and punctuation and case don't matter. In the spirit of TIMTOWTDI, here's what I scraped together: # FRAGMENT my $lcs = lcss(standardize($x1), standardize($x2); sub lcss { my ($str1, $str2 ) = @_; my @match = (); my @longest = (); my $i = 0; my $seq1 = [split(/\s+/, $str1)]; my $seq2 = [split (/\s+/, $str2)]; my $sub = sub { @longest = map {$_} @match if (@match >= @longest); @match = (); }; traverse_sequences( $seq1, $seq2, { MATCH => sub {push(@match, $seq1->[$_[0]]);}, DISCARD_A => $sub, DISCARD_B => $sub, }); my $lcs = join(' ', @longest); return $lcs; } # lowercase and remove odd characters sub standardize { my ($text) = @_; return unless $text; $text =~ s/\[.*?\]/ /g; $text =~ s/[.,?"':&()!-]/ /g; $text =~ s/[^\w ]//g; $text =~ s/^\s+//; $text =~ s/\s+$//; $text =~ s/\s+/ /; $text = lc $text; return $text; } [download] rkg	[reply] [d/l]
Re: Re: TIMTOWTDI by tachyon (Chancellor) on Sep 11, 2003 at 02:29 UTC
Hi, looking at your REs I have a couple of suggestions: `$text =~ s/\[.?\]/ /g; # you can lose the .? by using [^\]]* which save the RE engine ba +cktracking # $text =~ s/\[[^\]]*\]/ /g; $text =~ s/[.,?"':&()!-]/ /g; $text =~ s/[^\w ]//g; # you need a /m to make this match # here # and here # and here # ie this will only match the very begining of the string $text =~ s/^\s+//; # ditto $text =~ s/\s+$//; $text =~ s/\s+/ /;` [download] cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply] [d/l]
Re: Re: Re: TIMTOWTDI by rkg (Hermit) on Sep 11, 2003 at 10:38 UTC
Thanks for the `$text =~ s/\[[^\]]\]/ /g;` tip; that is better. As for the `/m` on these `$text =~ s/^\s+//; $text =~ s/\s+$//;` [download] I think I'd put the `/m` on the third* RE `$text =~ s/\s+/ /m; #yes?` [download] so taken together the triple REs mean: "remove leading whitespace, trailing whitespace, and make all interior whitespace (even across line breaks) into single spaces". Many thanks for the OT-but-useful teaching! rkg	[reply] [d/l] [select]
Re: Re: Re: Re: TIMTOWTDI by tachyon (Chancellor) on Sep 11, 2003 at 10:54 UTC

In Section Seekers of Perl Wisdom