This brings up a question: What should an lcss() routine return?
- All the longest common substrings?
If not:
- The 'first'.
From the left-hand end of the string; or right-hand end; or the first discovered by the most efficient algorithm?
- If the longest common substring appears multiple times at different offsets;
Should they
- All be returned;
- Or only the unique ones?
- The first found?
Of course, the only real answer is: it depends upon the application requirements. It's easy to suggest that all is the most flexible solution as it allows the application to choose what it needs; but that doesn't help for applications -- like the one I was addressing when I started this thread 7 years ago -- where the application only needs one; knows that there will only be one; and is time critical enough that it cannot afford to waste time searching for others that either it knows won't exist; or it doesn't care about even if they do.
Above you said "Un-optimized, but functionally correct code is better than broken code."; but that's only true if there is a definitive definition of "functionally correct". For example; the lcss_brute() function you've now uploaded; is at least by one definition, 'incorrect'; as it doesn't look for duplicates.
That can be corrected by adding another loop:
sub lcss_brute {
my( $r1, $r2, $min ) = @_;
my( $l1, $l2, $swap ) = ( length $$r1, length $$r2, 0 );
( $r1, $r2, $l1, $l2, $swap ) = ( $r2, $r1, $l2, $l1, 1 ) if $l1 >
+ $l2;
my( $best, @solns ) = 0;
for my $start ( 0 .. $l2 - 1 ) {
for my $l ( reverse 1 .. $l1 - $start ) {
my $substr = substr( $$r1, $start, $l );
my $o = 0;
while( $o = 1+index( $$r2, $substr, $o ) ) {
if( $l > $best ) {
$best = length $substr;
@solns = [ $substr, $start, $o-1 ];
}
elsif( $l == $best ) {
push @solns, [ $substr, $start, $o-1 ];
}
}
}
}
return \@solns;
}
Which means that for your latest example it now produces similar output to lcss_all(): [ 7:25:50.33] C:\test>lcss-test.pl
[["gh", 1, 7], ["gh", 1, 9], ["da", 3, 4]]
Though as you can see, because the algorithm operates in a different way to lcss_all(), it produces the matches in a different order. That could be cured by sorting by position, but then you've added another overhead for those applications that simply do not care.
For example: I finally dug out the CD that contained the code I sent to my client, and I discovered why they have never encountered the false hit problem you found in the OP code. In the application, the needles (shorter strings) are always newline terminated; and the haystacks (longer strings) do not contain embedded newlines, so that particular problem can never occur in their application. So for their purposes; my OP algorithm was and is still "functionally correct".
However, their primary concern was that they are/(were) processing huge volumes of data and the lcss() algorithm they were using was responsible for a high proportion of their processing time. The savings from my version of the algorithm combined with some other savings I found reduced their runtime by almost 50%. Slow and exhaustive would have been of no value to them at all.
On the CD I also found that I had coded up an Inline::C version of the OP algorithm that runs even faster; but that would (almost certainly) exhibit the same 'flaw' as the OP version. Thinking back I recall they rejected that as they did not have the skills to maintain C code. None the less; with the addition of the fix I gave above for the false hits problem, that would probably satisfy the needs of many users with a lcss() requirement.
So the problem is: where to go from here?
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
In the absence of evidence, opinion is indistinguishable from prejudice.
|