Re^3: LCS efficiency problem

Certainly. I've included my test code at the bottom of this post. And I'll attempt a brief explanation of it.

The input to my test code is a file of preprocessed sentences, one per line, with all extraneous characters removed. It reads all the lines into memory, breaking them into arrays of words:

my @sentences = map { [ split ] } <>;
[download]

It then builds a parallel array of hashes of uniq words in each sentence:

my @uniq;
$uniq[ $_ ] = { map{ $_ => 1} @{ $sentences[ $_ ] } } for 0 .. $#sente
+nces;
[download]

This is done up-front, as each sentence, and associated hash of uniqs is re-used many times during the run of the code.

The main body of the program consists of two nested loops over the indices of the sentences (and associated uniqs hashes), comparing each sentence against every other sentence in the usual way:

my( @aFragments, @bFragments );
for my $ai ( 0 .. $#sentences ) {
    my $sa = $sentences[ $ai ];
    my $ua = $uniq[ $ai ];

    for my $bi ( $ai+1 .. $#sentences ) {
        my $sb = $sentences[ $bi ];
        my $ub = $uniq[ $bi ];

        ## process sentence[ $ai ] against sentences[ $bi ]
    }
}
[download]

The variables $sa $sb & $ua $ub are references to the sentence arrays and uniq hashes for A & B respectively, and just simplify expressions in the subsequent processing.

The variables @aFragments & @bFragments are produced by calling fragmentSentence() passing the sentence arrayref of one sentence and the uniq hashref for the other, for each pairing.

sub fragmentSentence {
    my( $sa, $ub ) = @_;
    return reduce{
       exists $ub->{ $b }
           ? push @{ $a->[ $#$a ] }, $b
           : push @{ $a }, []
       ;
       $a;
    } [[]], @{ $sa };
}
[download]

Works by using reduce to build an array of arrays of contiguous words in sentence A that also appear in sentence B. And vice versa.

The central core of the processing then runs a standard LCS algorithm on each of the resultant fragments of sentence A against each of the fragments from sentence B, but taking the opportunity of an early exit at several points when it is obvious that no longer common sequence can be found than has already been seen:

        ## if there are no common words between the sentences, exit ea
+rly
        next unless first{ exists $ua->{ $_ } } keys %{ $ub };

        ## fragment both sentences, discarding one-word fragments
        ## and sorting them by number of words, longest first.
        @aFragments = sort{ @$b <=> @$a } grep{ @$_ > 1 } 
             @{ fragmentSentence( $sa, $ub ) };
        @bFragments = sort{ @$b <=> @$a } grep{ @$_ > 1 } 
             @{ fragmentSentence( $sb, $ua ) };

        ## Exit early if either sentence has no remaining fragments
        next unless @aFragments and @bFragments;
        my @best;

        ## For each A fragment
        for my $aFrag ( @aFragments ) {
            ## early exit if this fragment is shorter than the best so
+ far
            next if @$aFrag <= @best;

            ## for each B fragment
            for my $bFrag ( @bFragments ) {
                ## Ditto early exit
                next if @$bFrag <= @best;

                ## Finally, perform the LCS algorithm
                my @lcws = lcws( $aFrag, $bFrag );

                ## And save if its the longest yet seen
                @best = @lcws if @lcws > @best
            }
        }
        ## Skip if we didn't find one
        next unless @best;
         
        ## Output the sentence numbers and the best LCWS we found.
        printf "( %4d / %4d )=>[ %s ]\n", $ai, $bi, join( ' ', @best )
+;
[download]

Let me know if anything needs clarifying, and also, if it helps your problem.

The code:

#! perl -slw
use strict;
#use List::Util qw[ reduce ];

our $MAX ||= 0;

$|++;

## The pure perl verion of List::Util reduce() and first() subs includ
+ed here
## because the XS versions leak like a sieve in this application.
sub reduce (&@) {
  my $code = shift;
  no strict 'refs';

  return shift unless @_ > 1;

  my $caller = caller;
  local(*{$caller."::a"}) = \my $a;
  local(*{$caller."::b"}) = \my $b;

  $a = shift;
  foreach (@_) {
    $b = $_;
    $a = &{$code}();
  }
  $a;
}

sub first (&@) {
    my $code = shift;
    my $x;
    $x = &$code() and return $x for @_;
}

sub fragmentSentence {
    my( $sa, $ub ) = @_;
    return reduce{
       exists $ub->{ $b }
           ? push @{ $a->[ $#$a ] }, $b
           : push @{ $a }, []
       ;
       $a;
    } [[]], @{ $sa };
}

sub lcws { ## longest common word sequence
    my( $a, $b ) = @_;
    ( $a, $b ) = ( $b, $a ) if @{ $a } > @{ $b };
    my $aString = "@{ $a }";
    my @best;
    for my $start ( 0 .. $#{ $b } - 1 ) {
        for my $length ( reverse 1 .. $#{ $b } - $start ) {
            last if $length < @best;
            if( 1 + index $aString, qq[ @{$b}[$start..$start+$length] 
+] ) {
                @best = @{ $b }[ $start .. $start + $length ];
            }
        }
    }
    return @best;
}

my @sentences = map { [ split ] } <>;

my @uniq;
$uniq[ $_ ] = { map{ $_ => 1} @{ $sentences[ $_ ] } } for 0 .. $#sente
+nces;

my( @aFragments, @bFragments );
for my $ai ( 0 .. ( $MAX || $#sentences ) ) {
    my $sa = $sentences[ $ai ];
    my $ua = $uniq[ $ai ];
    for my $bi ( $ai+1 .. ( $MAX || $#sentences ) ) {
        my $sb = $sentences[ $bi ];
        my $ub = $uniq[ $bi ];
        next unless first{ exists $ua->{ $_ } } keys %{ $ub };
        @aFragments = sort{ @$b <=> @$a } grep{ @$_ > 1 } 
             @{ fragmentSentence( $sa, $ub ) };
        @bFragments = sort{ @$b <=> @$a } grep{ @$_ > 1 } 
             @{ fragmentSentence( $sb, $ua ) };
        next unless @aFragments and @bFragments;
        my @best;
        for my $aFrag ( @aFragments ) {
            next if @$aFrag <= @best;
            for my $bFrag ( @bFragments ) {
                next if @$bFrag <= @best;
                my @lcws = lcws( $aFrag, $bFrag );
                @best = @lcws if @lcws > @best
            }
        }
        next unless @best;
        printf "( %4d / %4d )=>[ %s ]\n", $ai, $bi, join( ' ', @best )
+;
    }
}
[download]

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Comment on Re^3: LCS efficiency problem Select or Download Code