Re: Faster indexing an array

Replies are listed 'Best First'.
Re^2: Faster indexing an array by wollmers (Scribe) on Sep 19, 2014 at 23:42 UTC
I need something similar as result as sdiff() of Algorithm::Diff provides. `$sequence1 = [ qw( a b ) ]; $sequence2 = [ qw( b c ) ]; # the longest common subsequence of it $LCS_index = [[ 1, 1 ]]; # aligned $result = [ [ 'a', '' ], [ 'b', 'b' ], [ '', 'c' ], ]; $stringified = [ 'ab_', '_bc', ];` [download] The two most popular of the fastest algorithms for LCS are Hunt/McIllroy (used in Algorithm::Diff, Algorithm::LCS from BackPAN written in XS) and Meyers/Ukkonen (used in GNU-diff, String::Similarity). What I implemented is an improved Hunt/McIllroy from AFROZA BEGUM, A GREEDY APPROACH FOR COMPUTING LONGEST COMMON SUBSEQUENCES, Journal of Prime Research in Mathematics Vol. 4(2008), 165-170. It beats A::D::sdiff(). To be fair A::D provides more functionality, which I also try to strip down for comparison. In the end I would like to modify the XS of A::LCS. A::LCS processes 0.8 Mio/s in comparison to 14 thousand/s A::D::sdiff() of length=10, edit-distance=4. But making the aligned hunks via perl from the LCS of A::LCS slows down to 35 thousand/s. A::LCS-aligned 35714.29/s (n=50000) lcs_greedy_aligned: 22831.05/s (n=50000) A::D::sdiff: 14204.55/s (n=50000) Here my code (dirty as it is work in progress): sub lcs_greedy { my $self = shift; my $X = shift; my $Y = shift; my $YPos; my $index = 0; push @{ $YPos->{$_} },$index++ for @$Y; my $Xmatches; for ( $index = 0 ; $index <= $#$X ; $index++ ) { if ( exists( $YPos->{$X->[$index]} ) ) { push ( @$Xmatches , $index ); } } my $Xcurrent = -1; my $Ycurrent = -1; my $m = $#$Xmatches; my $n = $#$Y; my @L = (); # LCS my $R = 0; # records the position of last selected symbol my $i = 0; my $Pi; my $Pi1; my $hunk; for ($i = 0; $i <= $m; $i++) { $hunk = []; $Pi = $YPos->{$X->[$Xmatches->[$i]]}->[0] // $n+1; # Position in Y +of ith symbol $Pi1 = ($i < $m && defined $YPos->{$X->[$Xmatches->[$i+1]]}->[0]) ? $YPos->{$X->[$Xmatches->[$i+1]]}->[0] : -1; # Position in Y of + i + 1st symbol #print STDERR '$i: ',$i,' $Pi: ',$Pi,' $Pi1: ',$Pi1,' $R: ',$R,"\n"; while ($Pi1 < $R && $Pi1 > -1) { #print STDERR '$Pi1 < $R',"\n"; shift @{$YPos->{$X->[$Xmatches->[$i+1]]}}; $Pi1 = $YPos->{$X->[$Xmatches->[$i+1]]}->[0] // -1; } while ($Pi < $R && $Pi < $n+1) { #print STDERR '$Pi < $R',"\n"; shift @{$YPos->{$X->[$Xmatches->[$i]]}}; $Pi = $YPos->{$X->[$Xmatches->[$i]]}->[0] // $n+1; } if ($Pi > $Pi1 && $Pi1 > $R) { $hunk = [$Xmatches->[$i+1],$Pi1]; shift @{$YPos->{$X->[$Xmatches->[$i+1]]}}; $R = $Pi1; $i = $i+1; } elsif ($Pi < $n+1) { $hunk = [$Xmatches->[$i],$Pi]; shift @{$YPos->{$X->[$Xmatches->[$i]]}}; $R = $Pi; } if (scalar @$hunk) { while ($Xcurrent+1 < $hunk->[0] \|\| $Ycurrent+1 < $hunk->[1] ) { my $Xtemp = ''; my $Ytemp = ''; if ($Xcurrent+1 < $hunk->[0]) { #$Xtemp = $Xcurrent+1; $Xtemp = $X->[$Xcurrent+1]; $Xcurrent++; } if ($Ycurrent+1 < $hunk->[1]) { #$Ytemp = $Ycurrent+1; $Ytemp = $Y->[$Ycurrent+1]; $Ycurrent++; } push @L,[$Xtemp,$Ytemp]; } $Xcurrent = $hunk->[0]; $Ycurrent = $hunk->[1]; #push @L,$hunk; # indices push @L,[$X->[$Xcurrent],$Y->[$Ycurrent]]; # elements } } while ($Xcurrent+1 <= $#$X \|\| $Ycurrent+1 <= $#$Y ) { my $Xtemp = ''; my $Ytemp = ''; if ($Xcurrent+1 <= $#$X) { #$Xtemp = $Xcurrent+1; $Xtemp = $X->[$Xcurrent+1]; $Xcurrent++; } if ($Ycurrent+1 <= $#$Y) { #$Ytemp = $Ycurrent+1; $Ytemp = $Y->[$Ycurrent+1]; $Ycurrent++; } push @L,[$Xtemp,$Ytemp]; } return \@L; } [download]	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^2: Faster indexing an array
by wollmers (Scribe) on Sep 19, 2014 at 23:42 UTC

I need something similar as result as sdiff() of Algorithm::Diff provides.

$sequence1 = [ qw( a b     ) ];
$sequence2 = [ qw(    b  c ) ]; 

# the longest common subsequence of it
$LCS_index = [[ 1, 1 ]];

# aligned
$result = [
  [ 'a', '' ],
  [ 'b', 'b' ],
  [ '',   'c' ],
];

$stringified = [
  'ab_',
  '_bc',
];
[download]

The two most popular of the fastest algorithms for LCS are Hunt/McIllroy (used in Algorithm::Diff, Algorithm::LCS from BackPAN written in XS) and Meyers/Ukkonen (used in GNU-diff, String::Similarity).

What I implemented is an improved Hunt/McIllroy from AFROZA BEGUM, A GREEDY APPROACH FOR COMPUTING LONGEST COMMON SUBSEQUENCES, Journal of Prime Research in Mathematics Vol. 4(2008), 165-170. It beats A::D::sdiff(). To be fair A::D provides more functionality, which I also try to strip down for comparison. In the end I would like to modify the XS of A::LCS. A::LCS processes 0.8 Mio/s in comparison to 14 thousand/s A::D::sdiff() of length=10, edit-distance=4. But making the aligned hunks via perl from the LCS of A::LCS slows down to 35 thousand/s.

A::LCS-aligned         35714.29/s (n=50000)
lcs_greedy_aligned:  22831.05/s (n=50000)
A::D::sdiff:               14204.55/s (n=50000)

Here my code (dirty as it is work in progress):

sub lcs_greedy {
  my $self = shift;
  my $X = shift;
  my $Y = shift;
  
  my $YPos;
  my $index = 0;
  push @{ $YPos->{$_} },$index++ for @$Y;
  
  my $Xmatches;
  for ( $index = 0 ; $index <= $#$X ; $index++ ) {
    if ( exists( $YPos->{$X->[$index]} ) ) {
      push ( @$Xmatches , $index );
    }
  }
  
  my $Xcurrent = -1;
  my $Ycurrent = -1;  
  
  my $m = $#$Xmatches;
    
  my $n = $#$Y;
  
  my @L = (); # LCS
  my $R = 0;  # records the position of last selected symbol
  my $i = 0;
  
  my $Pi;
  my $Pi1;
  
  my $hunk;
   
for ($i = 0; $i <= $m; $i++) {
  $hunk = [];
  $Pi =  $YPos->{$X->[$Xmatches->[$i]]}->[0] // $n+1; # Position in Y 
+of ith symbol
  $Pi1 =  ($i < $m && defined $YPos->{$X->[$Xmatches->[$i+1]]}->[0]) 
      ? $YPos->{$X->[$Xmatches->[$i+1]]}->[0] : -1; # Position in Y of
+ i + 1st symbol
  #print STDERR '$i: ',$i,' $Pi: ',$Pi,' $Pi1: ',$Pi1,' $R: ',$R,"\n";
  while ($Pi1 < $R && $Pi1 > -1) { 
    #print STDERR '$Pi1 < $R',"\n";
    shift @{$YPos->{$X->[$Xmatches->[$i+1]]}};
    $Pi1 = $YPos->{$X->[$Xmatches->[$i+1]]}->[0] // -1;
  }
  while ($Pi < $R && $Pi < $n+1) {
    #print STDERR '$Pi < $R',"\n";
    shift @{$YPos->{$X->[$Xmatches->[$i]]}};
    $Pi =  $YPos->{$X->[$Xmatches->[$i]]}->[0] // $n+1;
  }
  if ($Pi > $Pi1 && $Pi1 > $R) {
    $hunk = [$Xmatches->[$i+1],$Pi1];
    shift @{$YPos->{$X->[$Xmatches->[$i+1]]}};
    $R = $Pi1;
    $i = $i+1;
  } 
  elsif ($Pi <  $n+1) {
    $hunk = [$Xmatches->[$i],$Pi];
    shift @{$YPos->{$X->[$Xmatches->[$i]]}}; 
    $R = $Pi;
  }
  if (scalar @$hunk) { 
    while ($Xcurrent+1 < $hunk->[0] ||  $Ycurrent+1 < $hunk->[1] ) {
      my $Xtemp = '';
      my $Ytemp = '';
      if ($Xcurrent+1 < $hunk->[0]) {
        #$Xtemp = $Xcurrent+1;
        $Xtemp = $X->[$Xcurrent+1];
        $Xcurrent++;
      }
      if ($Ycurrent+1 < $hunk->[1]) {
        #$Ytemp = $Ycurrent+1;
        $Ytemp = $Y->[$Ycurrent+1];
        $Ycurrent++;
      }
      push @L,[$Xtemp,$Ytemp];
    }
    $Xcurrent = $hunk->[0];
    $Ycurrent = $hunk->[1];
    #push @L,$hunk; # indices
    push @L,[$X->[$Xcurrent],$Y->[$Ycurrent]]; # elements
  }
}
  while ($Xcurrent+1 <= $#$X ||  $Ycurrent+1 <= $#$Y ) {
    my $Xtemp = '';
    my $Ytemp = '';
    if ($Xcurrent+1 <= $#$X) {
      #$Xtemp = $Xcurrent+1;
      $Xtemp = $X->[$Xcurrent+1];
      $Xcurrent++;
    }
    if ($Ycurrent+1 <= $#$Y) {
      #$Ytemp = $Ycurrent+1;
      $Ytemp = $Y->[$Ycurrent+1];
      $Ycurrent++;
    }
    push @L,[$Xtemp,$Ytemp];
  }  
return \@L;
}
[download]

[reply]
[d/l]
[select]