Re^2: LCS efficiency problem

Replies are listed 'Best First'.
Re^3: LCS efficiency problem by BrowserUk (Patriarch) on Jun 10, 2008 at 18:51 UTC
Certainly. I've included my test code at the bottom of this post. And I'll attempt a brief explanation of it. The input to my test code is a file of preprocessed sentences, one per line, with all extraneous characters removed. It reads all the lines into memory, breaking them into arrays of words: `my @sentences = map { [ split ] } <>;` [download] It then builds a parallel array of hashes of uniq words in each sentence: `my @uniq; $uniq[ $_ ] = { map{ $_ => 1} @{ $sentences[ $_ ] } } for 0 .. $#sente +nces;` [download] This is done up-front, as each sentence, and associated hash of uniqs is re-used many times during the run of the code. The main body of the program consists of two nested loops over the indices of the sentences (and associated uniqs hashes), comparing each sentence against every other sentence in the usual way: `my( @aFragments, @bFragments ); for my $ai ( 0 .. $#sentences ) { my $sa = $sentences[ $ai ]; my $ua = $uniq[ $ai ]; for my $bi ( $ai+1 .. $#sentences ) { my $sb = $sentences[ $bi ]; my $ub = $uniq[ $bi ]; ## process sentence[ $ai ] against sentences[ $bi ] } }` [download] The variables `$sa $sb` & `$ua $ub` are references to the sentence arrays and uniq hashes for A & B respectively, and just simplify expressions in the subsequent processing. The variables `@aFragments` & `@bFragments` are produced by calling `fragmentSentence()` passing the sentence arrayref of one sentence and the uniq hashref for the other, for each pairing. `sub fragmentSentence { my( $sa, $ub ) = @_; return reduce{ exists $ub->{ $b } ? push @{ $a->[ $#$a ] }, $b : push @{ $a }, [] ; $a; } [[]], @{ $sa }; }` [download] Works by using reduce to build an array of arrays of contiguous words in sentence A that also appear in sentence B. And vice versa. The central core of the processing then runs a standard LCS algorithm on each of the resultant fragments of sentence A against each of the fragments from sentence B, but taking the opportunity of an early exit at several points when it is obvious that no longer common sequence can be found than has already been seen: ## if there are no common words between the sentences, exit ea +rly next unless first{ exists $ua->{ $_ } } keys %{ $ub }; ## fragment both sentences, discarding one-word fragments ## and sorting them by number of words, longest first. @aFragments = sort{ @$b <=> @$a } grep{ @$_ > 1 } @{ fragmentSentence( $sa, $ub ) }; @bFragments = sort{ @$b <=> @$a } grep{ @$_ > 1 } @{ fragmentSentence( $sb, $ua ) }; ## Exit early if either sentence has no remaining fragments next unless @aFragments and @bFragments; my @best; ## For each A fragment for my $aFrag ( @aFragments ) { ## early exit if this fragment is shorter than the best so + far next if @$aFrag <= @best; ## for each B fragment for my $bFrag ( @bFragments ) { ## Ditto early exit next if @$bFrag <= @best; ## Finally, perform the LCS algorithm my @lcws = lcws( $aFrag, $bFrag ); ## And save if its the longest yet seen @best = @lcws if @lcws > @best } } ## Skip if we didn't find one next unless @best; ## Output the sentence numbers and the best LCWS we found. printf "( %4d / %4d )=>[ %s ]\n", $ai, $bi, join( ' ', @best ) +; [download] Let me know if anything needs clarifying, and also, if it helps your problem. The code: Read more... (3 kB) Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^3: LCS efficiency problem
by BrowserUk (Patriarch) on Jun 10, 2008 at 18:51 UTC

Certainly. I've included my test code at the bottom of this post. And I'll attempt a brief explanation of it.

The input to my test code is a file of preprocessed sentences, one per line, with all extraneous characters removed. It reads all the lines into memory, breaking them into arrays of words:

my @sentences = map { [ split ] } <>;
[download]

It then builds a parallel array of hashes of uniq words in each sentence:

my @uniq;
$uniq[ $_ ] = { map{ $_ => 1} @{ $sentences[ $_ ] } } for 0 .. $#sente
+nces;
[download]

This is done up-front, as each sentence, and associated hash of uniqs is re-used many times during the run of the code.

The main body of the program consists of two nested loops over the indices of the sentences (and associated uniqs hashes), comparing each sentence against every other sentence in the usual way:

my( @aFragments, @bFragments );
for my $ai ( 0 .. $#sentences ) {
    my $sa = $sentences[ $ai ];
    my $ua = $uniq[ $ai ];

    for my $bi ( $ai+1 .. $#sentences ) {
        my $sb = $sentences[ $bi ];
        my $ub = $uniq[ $bi ];

        ## process sentence[ $ai ] against sentences[ $bi ]
    }
}
[download]

The variables $sa $sb & $ua $ub are references to the sentence arrays and uniq hashes for A & B respectively, and just simplify expressions in the subsequent processing.

The variables @aFragments & @bFragments are produced by calling fragmentSentence() passing the sentence arrayref of one sentence and the uniq hashref for the other, for each pairing.

sub fragmentSentence {
    my( $sa, $ub ) = @_;
    return reduce{
       exists $ub->{ $b }
           ? push @{ $a->[ $#$a ] }, $b
           : push @{ $a }, []
       ;
       $a;
    } [[]], @{ $sa };
}
[download]

Works by using reduce to build an array of arrays of contiguous words in sentence A that also appear in sentence B. And vice versa.

The central core of the processing then runs a standard LCS algorithm on each of the resultant fragments of sentence A against each of the fragments from sentence B, but taking the opportunity of an early exit at several points when it is obvious that no longer common sequence can be found than has already been seen:

        ## if there are no common words between the sentences, exit ea
+rly
        next unless first{ exists $ua->{ $_ } } keys %{ $ub };

        ## fragment both sentences, discarding one-word fragments
        ## and sorting them by number of words, longest first.
        @aFragments = sort{ @$b <=> @$a } grep{ @$_ > 1 } 
             @{ fragmentSentence( $sa, $ub ) };
        @bFragments = sort{ @$b <=> @$a } grep{ @$_ > 1 } 
             @{ fragmentSentence( $sb, $ua ) };

        ## Exit early if either sentence has no remaining fragments
        next unless @aFragments and @bFragments;
        my @best;

        ## For each A fragment
        for my $aFrag ( @aFragments ) {
            ## early exit if this fragment is shorter than the best so
+ far
            next if @$aFrag <= @best;

            ## for each B fragment
            for my $bFrag ( @bFragments ) {
                ## Ditto early exit
                next if @$bFrag <= @best;

                ## Finally, perform the LCS algorithm
                my @lcws = lcws( $aFrag, $bFrag );

                ## And save if its the longest yet seen
                @best = @lcws if @lcws > @best
            }
        }
        ## Skip if we didn't find one
        next unless @best;
         
        ## Output the sentence numbers and the best LCWS we found.
        printf "( %4d / %4d )=>[ %s ]\n", $ai, $bi, join( ' ', @best )
+;
[download]

Let me know if anything needs clarifying, and also, if it helps your problem.

The code: