Re^2: Faster indexing an array

I need something similar as result as sdiff() of Algorithm::Diff provides.

$sequence1 = [ qw( a b     ) ];
$sequence2 = [ qw(    b  c ) ]; 

# the longest common subsequence of it
$LCS_index = [[ 1, 1 ]];

# aligned
$result = [
  [ 'a', '' ],
  [ 'b', 'b' ],
  [ '',   'c' ],
];

$stringified = [
  'ab_',
  '_bc',
];
[download]

The two most popular of the fastest algorithms for LCS are Hunt/McIllroy (used in Algorithm::Diff, Algorithm::LCS from BackPAN written in XS) and Meyers/Ukkonen (used in GNU-diff, String::Similarity).

What I implemented is an improved Hunt/McIllroy from AFROZA BEGUM, A GREEDY APPROACH FOR COMPUTING LONGEST COMMON SUBSEQUENCES, Journal of Prime Research in Mathematics Vol. 4(2008), 165-170. It beats A::D::sdiff(). To be fair A::D provides more functionality, which I also try to strip down for comparison. In the end I would like to modify the XS of A::LCS. A::LCS processes 0.8 Mio/s in comparison to 14 thousand/s A::D::sdiff() of length=10, edit-distance=4. But making the aligned hunks via perl from the LCS of A::LCS slows down to 35 thousand/s.

A::LCS-aligned         35714.29/s (n=50000)
lcs_greedy_aligned:  22831.05/s (n=50000)
A::D::sdiff:               14204.55/s (n=50000)

Here my code (dirty as it is work in progress):

sub lcs_greedy {
  my $self = shift;
  my $X = shift;
  my $Y = shift;
  
  my $YPos;
  my $index = 0;
  push @{ $YPos->{$_} },$index++ for @$Y;
  
  my $Xmatches;
  for ( $index = 0 ; $index <= $#$X ; $index++ ) {
    if ( exists( $YPos->{$X->[$index]} ) ) {
      push ( @$Xmatches , $index );
    }
  }
  
  my $Xcurrent = -1;
  my $Ycurrent = -1;  
  
  my $m = $#$Xmatches;
    
  my $n = $#$Y;
  
  my @L = (); # LCS
  my $R = 0;  # records the position of last selected symbol
  my $i = 0;
  
  my $Pi;
  my $Pi1;
  
  my $hunk;
   
for ($i = 0; $i <= $m; $i++) {
  $hunk = [];
  $Pi =  $YPos->{$X->[$Xmatches->[$i]]}->[0] // $n+1; # Position in Y 
+of ith symbol
  $Pi1 =  ($i < $m && defined $YPos->{$X->[$Xmatches->[$i+1]]}->[0]) 
      ? $YPos->{$X->[$Xmatches->[$i+1]]}->[0] : -1; # Position in Y of
+ i + 1st symbol
  #print STDERR '$i: ',$i,' $Pi: ',$Pi,' $Pi1: ',$Pi1,' $R: ',$R,"\n";
  while ($Pi1 < $R && $Pi1 > -1) { 
    #print STDERR '$Pi1 < $R',"\n";
    shift @{$YPos->{$X->[$Xmatches->[$i+1]]}};
    $Pi1 = $YPos->{$X->[$Xmatches->[$i+1]]}->[0] // -1;
  }
  while ($Pi < $R && $Pi < $n+1) {
    #print STDERR '$Pi < $R',"\n";
    shift @{$YPos->{$X->[$Xmatches->[$i]]}};
    $Pi =  $YPos->{$X->[$Xmatches->[$i]]}->[0] // $n+1;
  }
  if ($Pi > $Pi1 && $Pi1 > $R) {
    $hunk = [$Xmatches->[$i+1],$Pi1];
    shift @{$YPos->{$X->[$Xmatches->[$i+1]]}};
    $R = $Pi1;
    $i = $i+1;
  } 
  elsif ($Pi <  $n+1) {
    $hunk = [$Xmatches->[$i],$Pi];
    shift @{$YPos->{$X->[$Xmatches->[$i]]}}; 
    $R = $Pi;
  }
  if (scalar @$hunk) { 
    while ($Xcurrent+1 < $hunk->[0] ||  $Ycurrent+1 < $hunk->[1] ) {
      my $Xtemp = '';
      my $Ytemp = '';
      if ($Xcurrent+1 < $hunk->[0]) {
        #$Xtemp = $Xcurrent+1;
        $Xtemp = $X->[$Xcurrent+1];
        $Xcurrent++;
      }
      if ($Ycurrent+1 < $hunk->[1]) {
        #$Ytemp = $Ycurrent+1;
        $Ytemp = $Y->[$Ycurrent+1];
        $Ycurrent++;
      }
      push @L,[$Xtemp,$Ytemp];
    }
    $Xcurrent = $hunk->[0];
    $Ycurrent = $hunk->[1];
    #push @L,$hunk; # indices
    push @L,[$X->[$Xcurrent],$Y->[$Ycurrent]]; # elements
  }
}
  while ($Xcurrent+1 <= $#$X ||  $Ycurrent+1 <= $#$Y ) {
    my $Xtemp = '';
    my $Ytemp = '';
    if ($Xcurrent+1 <= $#$X) {
      #$Xtemp = $Xcurrent+1;
      $Xtemp = $X->[$Xcurrent+1];
      $Xcurrent++;
    }
    if ($Ycurrent+1 <= $#$Y) {
      #$Ytemp = $Ycurrent+1;
      $Ytemp = $Y->[$Ycurrent+1];
      $Ycurrent++;
    }
    push @L,[$Xtemp,$Ytemp];
  }  
return \@L;
}
[download]

Comment on Re^2: Faster indexing an array Select or Download Code


Perl Monk, Perl Meditation
	PerlMonks