in reply to Faster indexing an array

> takes ~15% of an fast alignment algorithm

you should show us more of the whole algorithm.

I'm pretty sure our solutions for this isolated part can't be done faster in pure Perl.

But probably you just need another general tactic.

Cheers Rolf

(addicted to the Perl Programming Language and ☆☆☆☆ :)

Replies are listed 'Best First'.
Re^2: Faster indexing an array
by wollmers (Scribe) on Sep 19, 2014 at 23:42 UTC

    I need something similar as result as sdiff() of Algorithm::Diff provides.

    $sequence1 = [ qw( a b ) ]; $sequence2 = [ qw( b c ) ]; # the longest common subsequence of it $LCS_index = [[ 1, 1 ]]; # aligned $result = [ [ 'a', '' ], [ 'b', 'b' ], [ '', 'c' ], ]; $stringified = [ 'ab_', '_bc', ];

    The two most popular of the fastest algorithms for LCS are Hunt/McIllroy (used in Algorithm::Diff, Algorithm::LCS from BackPAN written in XS) and Meyers/Ukkonen (used in GNU-diff, String::Similarity).

    What I implemented is an improved Hunt/McIllroy from AFROZA BEGUM, A GREEDY APPROACH FOR COMPUTING LONGEST COMMON SUBSEQUENCES, Journal of Prime Research in Mathematics Vol. 4(2008), 165-170. It beats A::D::sdiff(). To be fair A::D provides more functionality, which I also try to strip down for comparison. In the end I would like to modify the XS of A::LCS. A::LCS processes 0.8 Mio/s in comparison to 14 thousand/s A::D::sdiff() of length=10, edit-distance=4. But making the aligned hunks via perl from the LCS of A::LCS slows down to 35 thousand/s.

    A::LCS-aligned         35714.29/s (n=50000)
    lcs_greedy_aligned:  22831.05/s (n=50000)
    A::D::sdiff:               14204.55/s (n=50000)
    

    Here my code (dirty as it is work in progress):

    sub lcs_greedy { my $self = shift; my $X = shift; my $Y = shift; my $YPos; my $index = 0; push @{ $YPos->{$_} },$index++ for @$Y; my $Xmatches; for ( $index = 0 ; $index <= $#$X ; $index++ ) { if ( exists( $YPos->{$X->[$index]} ) ) { push ( @$Xmatches , $index ); } } my $Xcurrent = -1; my $Ycurrent = -1; my $m = $#$Xmatches; my $n = $#$Y; my @L = (); # LCS my $R = 0; # records the position of last selected symbol my $i = 0; my $Pi; my $Pi1; my $hunk; for ($i = 0; $i <= $m; $i++) { $hunk = []; $Pi = $YPos->{$X->[$Xmatches->[$i]]}->[0] // $n+1; # Position in Y +of ith symbol $Pi1 = ($i < $m && defined $YPos->{$X->[$Xmatches->[$i+1]]}->[0]) ? $YPos->{$X->[$Xmatches->[$i+1]]}->[0] : -1; # Position in Y of + i + 1st symbol #print STDERR '$i: ',$i,' $Pi: ',$Pi,' $Pi1: ',$Pi1,' $R: ',$R,"\n"; while ($Pi1 < $R && $Pi1 > -1) { #print STDERR '$Pi1 < $R',"\n"; shift @{$YPos->{$X->[$Xmatches->[$i+1]]}}; $Pi1 = $YPos->{$X->[$Xmatches->[$i+1]]}->[0] // -1; } while ($Pi < $R && $Pi < $n+1) { #print STDERR '$Pi < $R',"\n"; shift @{$YPos->{$X->[$Xmatches->[$i]]}}; $Pi = $YPos->{$X->[$Xmatches->[$i]]}->[0] // $n+1; } if ($Pi > $Pi1 && $Pi1 > $R) { $hunk = [$Xmatches->[$i+1],$Pi1]; shift @{$YPos->{$X->[$Xmatches->[$i+1]]}}; $R = $Pi1; $i = $i+1; } elsif ($Pi < $n+1) { $hunk = [$Xmatches->[$i],$Pi]; shift @{$YPos->{$X->[$Xmatches->[$i]]}}; $R = $Pi; } if (scalar @$hunk) { while ($Xcurrent+1 < $hunk->[0] || $Ycurrent+1 < $hunk->[1] ) { my $Xtemp = ''; my $Ytemp = ''; if ($Xcurrent+1 < $hunk->[0]) { #$Xtemp = $Xcurrent+1; $Xtemp = $X->[$Xcurrent+1]; $Xcurrent++; } if ($Ycurrent+1 < $hunk->[1]) { #$Ytemp = $Ycurrent+1; $Ytemp = $Y->[$Ycurrent+1]; $Ycurrent++; } push @L,[$Xtemp,$Ytemp]; } $Xcurrent = $hunk->[0]; $Ycurrent = $hunk->[1]; #push @L,$hunk; # indices push @L,[$X->[$Xcurrent],$Y->[$Ycurrent]]; # elements } } while ($Xcurrent+1 <= $#$X || $Ycurrent+1 <= $#$Y ) { my $Xtemp = ''; my $Ytemp = ''; if ($Xcurrent+1 <= $#$X) { #$Xtemp = $Xcurrent+1; $Xtemp = $X->[$Xcurrent+1]; $Xcurrent++; } if ($Ycurrent+1 <= $#$Y) { #$Ytemp = $Ycurrent+1; $Ytemp = $Y->[$Ycurrent+1]; $Ycurrent++; } push @L,[$Xtemp,$Ytemp]; } return \@L; }