I need something similar as result as sdiff() of Algorithm::Diff provides.
$sequence1 = [ qw( a b ) ];
$sequence2 = [ qw( b c ) ];
# the longest common subsequence of it
$LCS_index = [[ 1, 1 ]];
# aligned
$result = [
[ 'a', '' ],
[ 'b', 'b' ],
[ '', 'c' ],
];
$stringified = [
'ab_',
'_bc',
];
The two most popular of the fastest algorithms for LCS are Hunt/McIllroy (used in Algorithm::Diff, Algorithm::LCS from BackPAN written in XS) and Meyers/Ukkonen (used in GNU-diff, String::Similarity).
What I implemented is an improved Hunt/McIllroy from
AFROZA BEGUM, A GREEDY APPROACH FOR COMPUTING LONGEST COMMON SUBSEQUENCES, Journal of Prime Research in Mathematics Vol. 4(2008), 165-170. It beats A::D::sdiff(). To be fair A::D provides more functionality, which I also try to strip down for comparison. In the end I would like to modify the XS of A::LCS. A::LCS processes 0.8 Mio/s in comparison to 14 thousand/s A::D::sdiff() of length=10, edit-distance=4. But making the aligned hunks via perl from the LCS of A::LCS slows down to 35 thousand/s.
A::LCS-aligned 35714.29/s (n=50000)
lcs_greedy_aligned: 22831.05/s (n=50000)
A::D::sdiff: 14204.55/s (n=50000)
Here my code (dirty as it is work in progress):
sub lcs_greedy {
my $self = shift;
my $X = shift;
my $Y = shift;
my $YPos;
my $index = 0;
push @{ $YPos->{$_} },$index++ for @$Y;
my $Xmatches;
for ( $index = 0 ; $index <= $#$X ; $index++ ) {
if ( exists( $YPos->{$X->[$index]} ) ) {
push ( @$Xmatches , $index );
}
}
my $Xcurrent = -1;
my $Ycurrent = -1;
my $m = $#$Xmatches;
my $n = $#$Y;
my @L = (); # LCS
my $R = 0; # records the position of last selected symbol
my $i = 0;
my $Pi;
my $Pi1;
my $hunk;
for ($i = 0; $i <= $m; $i++) {
$hunk = [];
$Pi = $YPos->{$X->[$Xmatches->[$i]]}->[0] // $n+1; # Position in Y
+of ith symbol
$Pi1 = ($i < $m && defined $YPos->{$X->[$Xmatches->[$i+1]]}->[0])
? $YPos->{$X->[$Xmatches->[$i+1]]}->[0] : -1; # Position in Y of
+ i + 1st symbol
#print STDERR '$i: ',$i,' $Pi: ',$Pi,' $Pi1: ',$Pi1,' $R: ',$R,"\n";
while ($Pi1 < $R && $Pi1 > -1) {
#print STDERR '$Pi1 < $R',"\n";
shift @{$YPos->{$X->[$Xmatches->[$i+1]]}};
$Pi1 = $YPos->{$X->[$Xmatches->[$i+1]]}->[0] // -1;
}
while ($Pi < $R && $Pi < $n+1) {
#print STDERR '$Pi < $R',"\n";
shift @{$YPos->{$X->[$Xmatches->[$i]]}};
$Pi = $YPos->{$X->[$Xmatches->[$i]]}->[0] // $n+1;
}
if ($Pi > $Pi1 && $Pi1 > $R) {
$hunk = [$Xmatches->[$i+1],$Pi1];
shift @{$YPos->{$X->[$Xmatches->[$i+1]]}};
$R = $Pi1;
$i = $i+1;
}
elsif ($Pi < $n+1) {
$hunk = [$Xmatches->[$i],$Pi];
shift @{$YPos->{$X->[$Xmatches->[$i]]}};
$R = $Pi;
}
if (scalar @$hunk) {
while ($Xcurrent+1 < $hunk->[0] || $Ycurrent+1 < $hunk->[1] ) {
my $Xtemp = '';
my $Ytemp = '';
if ($Xcurrent+1 < $hunk->[0]) {
#$Xtemp = $Xcurrent+1;
$Xtemp = $X->[$Xcurrent+1];
$Xcurrent++;
}
if ($Ycurrent+1 < $hunk->[1]) {
#$Ytemp = $Ycurrent+1;
$Ytemp = $Y->[$Ycurrent+1];
$Ycurrent++;
}
push @L,[$Xtemp,$Ytemp];
}
$Xcurrent = $hunk->[0];
$Ycurrent = $hunk->[1];
#push @L,$hunk; # indices
push @L,[$X->[$Xcurrent],$Y->[$Ycurrent]]; # elements
}
}
while ($Xcurrent+1 <= $#$X || $Ycurrent+1 <= $#$Y ) {
my $Xtemp = '';
my $Ytemp = '';
if ($Xcurrent+1 <= $#$X) {
#$Xtemp = $Xcurrent+1;
$Xtemp = $X->[$Xcurrent+1];
$Xcurrent++;
}
if ($Ycurrent+1 <= $#$Y) {
#$Ytemp = $Ycurrent+1;
$Ytemp = $Y->[$Ycurrent+1];
$Ycurrent++;
}
push @L,[$Xtemp,$Ytemp];
}
return \@L;
}