What I'd really like is a way to generate tests automatically
I agree 100% that a testcase generator is the only sure way to test optimisations for this type of algorithm. However, I think I would approach that goal from a somewhat different angle.
Firstly, I believe that it is relatively easy to code (if not so easy to prove) an exhaustive (thus slow) algorithm that is guaranteed to inspect all possibilities; and by extension find all 'best' solutions.
Here is (what I believe to be) such an implementation:
sub lcss_brute {
my( $r1, $r2, $min ) = @_;
my( $l1, $l2, $swap ) = ( length $$r1, length $$r2, 0 );
( $r1, $r2, $l1, $l2, $swap ) = ( $r2, $r1, $l2, $l1, 1 ) if $l1 >
+ $l2;
my( $best, @solns ) = 0;
for my $start ( 0 .. $l2 - 1 ) {
for my $l ( reverse 1 .. $l1 - $start ) {
my $substr = substr( $$r1, $start, $l );
my $o = index( $$r2, $substr );
next if $o < 0;
if( $l > $best ) {
$best = length $substr;
@solns = [ $substr, $start, $o ];
}
elsif( $l == $best ) {
push @solns, [ $substr, $start, $o ];
}
}
}
return \@solns;
}
Supplied with the bug you found, it produces both valid solutions:: pp lcss_brute( \'xxxyyxxy', \'yyyxyxx', 1 );
C:\test>lcss-test.pl
[["yyx", 1, 3], ["yxx", 4, 4]]
That, if others agree it can neither miss solutions nor generate false ones, takes care of the "reference model".
Then comes the problem of ensuring you generate a set of tests that is guaranteed (or at least very likely) to explore all the possible failure modes for optimised implementations.
Initially, that seems like a 'how long is a piece of string' (substring:) problem; but thinking back to the little reading I've done on DFA/NFA theory, in particular the observation that for algorithmic proof purposes, single letters are often substituted ("without loss of generality") for unique substrings -- eg. 'aba' is equivalent to both 'the quick the' & 'onetwoone' etc. -- then I believe that it is possible to exercise all possible failure modes by iterating all the permutations of a relatively limited set of characters to produce the 'first string' inputs; and then iterating all the variations of all the substrings of that first string to produce the 'second string' inputs.
Please note that my use of the terms 'permutations' and 'variations' in the above is in their common understanding meanings rather than their specific meanings in the combinatorics sense. That is to say, I have yet worked out whether the above should use permutations() or combinations() or variations() or *_with_repetitions() for the two parts of the generation.
To this end, I've come up with this as a first pass at an exhaustive testcase generator: #! perl -slw
use strict;
use Data::Dump qw[ pp ]; $Data::Dump::WIDTH = 1000;
use Algorithm::Combinatorics qw[ variations_with_repetition permutatio
+ns ];
sub lcss_brute {
my( $r1, $r2, $min ) = @_;
my( $l1, $l2, $swap ) = ( length $$r1, length $$r2, 0 );
( $r1, $r2, $l1, $l2, $swap ) = ( $r2, $r1, $l2, $l1, 1 ) if $l1 >
+ $l2;
my( $best, @solns ) = 0;
for my $start ( 0 .. $l2 - 1 ) {
for my $l ( reverse 1 .. $l1 - $start ) {
my $substr = substr( $$r1, $start, $l );
my $o = index( $$r2, $substr );
next if $o < 0;
if( $l > $best ) {
$best = length $substr;
@solns = [ $substr, $start, $o ];
}
elsif( $l == $best ) {
push @solns, [ $substr, $start, $o ];
}
}
}
return \@solns;
}
my @chars = 'a' .. 'e';
my $iter1 = permutations( \@chars );
while( $_ = $iter1->next ) {
my $long = join '', @$_;
for my $l ( 2 .. $#chars ) {
my $iter = variations_with_repetition( $_, $l );
while( my $r = $iter->next ) {
my $short = join '', @$r;
my $solns = lcss_brute( \$long, \$short, 1 );
next unless defined( $solns );
printf "\rFrom '%s' in '%s'; solns:'%s'\t\t\t", $short, $l
+ong, pp $solns;
}
}
}
Running that against all the permutation of 'a'..'e' as the first input; and all the variations with repetition of all lengths of substring of the generated first strings, runs in around a minute. I'm not sure yet, but that may be enough?
I'm not convinced whether I should be using permutations or combinations; and maybe it could be just variations rather than variations with repetitions; and I'm not yet convinced whether 5 characters is enough; but may be others have some inputs at this point?
An interesting exercise would be to run this against the known failing implementations and see if it detects problems; and if those problems it detects can be logically reduced to be the same as the known problems.
Anyway, that's as far as my logic will allow me to go at this point; let me know what you (or anyone else still watching) think.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
In the absence of evidence, opinion is indistinguishable from prejudice.
|