Certainly. I've included my test code at the bottom of this post. And I'll attempt a brief explanation of it.
The input to my test code is a file of preprocessed sentences, one per line, with all extraneous characters removed. It reads all the lines into memory, breaking them into arrays of words:
my @sentences = map { [ split ] } <>;
It then builds a parallel array of hashes of uniq words in each sentence:
my @uniq;
$uniq[ $_ ] = { map{ $_ => 1} @{ $sentences[ $_ ] } } for 0 .. $#sente
+nces;
This is done up-front, as each sentence, and associated hash of uniqs is re-used many times during the run of the code.
The main body of the program consists of two nested loops over the indices of the sentences (and associated uniqs hashes), comparing each sentence against every other sentence in the usual way:
my( @aFragments, @bFragments );
for my $ai ( 0 .. $#sentences ) {
my $sa = $sentences[ $ai ];
my $ua = $uniq[ $ai ];
for my $bi ( $ai+1 .. $#sentences ) {
my $sb = $sentences[ $bi ];
my $ub = $uniq[ $bi ];
## process sentence[ $ai ] against sentences[ $bi ]
}
}
The variables $sa $sb & $ua $ub are references to the sentence arrays and uniq hashes for A & B respectively, and just simplify expressions in the subsequent processing.
The variables @aFragments & @bFragments are produced by calling fragmentSentence() passing the sentence arrayref of one sentence and the uniq hashref for the other, for each pairing.
sub fragmentSentence {
my( $sa, $ub ) = @_;
return reduce{
exists $ub->{ $b }
? push @{ $a->[ $#$a ] }, $b
: push @{ $a }, []
;
$a;
} [[]], @{ $sa };
}
Works by using reduce to build an array of arrays of contiguous words in sentence A that also appear in sentence B. And vice versa.
The central core of the processing then runs a standard LCS algorithm on each of the resultant fragments of sentence A against each of the fragments from sentence B, but taking the opportunity of an early exit at several points when it is obvious that no longer common sequence can be found than has already been seen:
## if there are no common words between the sentences, exit ea
+rly
next unless first{ exists $ua->{ $_ } } keys %{ $ub };
## fragment both sentences, discarding one-word fragments
## and sorting them by number of words, longest first.
@aFragments = sort{ @$b <=> @$a } grep{ @$_ > 1 }
@{ fragmentSentence( $sa, $ub ) };
@bFragments = sort{ @$b <=> @$a } grep{ @$_ > 1 }
@{ fragmentSentence( $sb, $ua ) };
## Exit early if either sentence has no remaining fragments
next unless @aFragments and @bFragments;
my @best;
## For each A fragment
for my $aFrag ( @aFragments ) {
## early exit if this fragment is shorter than the best so
+ far
next if @$aFrag <= @best;
## for each B fragment
for my $bFrag ( @bFragments ) {
## Ditto early exit
next if @$bFrag <= @best;
## Finally, perform the LCS algorithm
my @lcws = lcws( $aFrag, $bFrag );
## And save if its the longest yet seen
@best = @lcws if @lcws > @best
}
}
## Skip if we didn't find one
next unless @best;
## Output the sentence numbers and the best LCWS we found.
printf "( %4d / %4d )=>[ %s ]\n", $ai, $bi, join( ' ', @best )
+;
Let me know if anything needs clarifying, and also, if it helps your problem.
The code:
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
|