Re^5: A better implementation of LCSS? (Do you have any combinatorics expertise to bring to bear?)

I used your test generator to confirm that your brute-force solution matches both String::LCSS_XS and the wiki code.

I created another generator which represents more of a pounding fists on a keyboard approach -- random strings of random lengths as input. The fact that String::LCSS_XS::lcss_all returns a different number of longest common substrings from your brute code is giving me an ice cream headache. So, I won't be of much use in any further theoretical analysis. Here is an example pair of input strings that shows what I'm talking about:

gghdagahkk
akakdadghgh

BrowserUk lcss_brute:
$VAR1 = [
          'gh',
          'da'
        ];


String::LCSS_XS::lcss_all:
$VAR1 = [
          'da',
          'gh',
          'gh'
        ];
[download]

After I uniq and sort the results, the 2 models match (I have checked this on millions of input string pairs).

In any case, I think I'll revise the patch I uploaded for String::LCSS to use your new code. Un-optimized, but functionally correct code is better than broken code. (UPDATE: I uploaded a new patch)

Comment on Re^5: A better implementation of LCSS? (Do you have any combinatorics expertise to bring to bear?) Download Code

Replies are listed 'Best First'.
Re^6: A better implementation of LCSS? (Do you have any combinatorics expertise to bring to bear?) by BrowserUk (Patriarch) on Nov 16, 2015 at 16:20 UTC
Does String::LCSS_XS::lcss_all() return the offsets? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^7: A better implementation of LCSS? (Do you have any combinatorics expertise to bring to bear?) by toolic (Bishop) on Nov 16, 2015 at 16:43 UTC
Yes: `@arr = lcss_all('gghdagahkk', 'akakdadghgh'); use Data::Dumper; print Dumper(\@arr); __END__ $VAR1 = [ [ 'da', 3, 4 ], [ 'gh', 1, 7 ], [ 'gh', 1, 9 ] ];` [download]	[reply] [d/l]
Re^8: A better implementation of LCSS? (What should a Longest Common Substring function return?) by BrowserUk (Patriarch) on Nov 17, 2015 at 08:07 UTC
This brings up a question: What should an lcss() routine return? All the longest common substrings? If not: The 'first'. From the left-hand end of the string; or right-hand end; or the first discovered by the most efficient algorithm? If the longest common substring appears multiple times at different offsets; Should they All be returned; Or only the unique ones? The first found? Of course, the only real answer is: it depends upon the application requirements. It's easy to suggest that all is the most flexible solution as it allows the application to choose what it needs; but that doesn't help for applications -- like the one I was addressing when I started this thread 7 years ago -- where the application only needs one; knows that there will only be one; and is time critical enough that it cannot afford to waste time searching for others that either it knows won't exist; or it doesn't care about even if they do. Above you said "Un-optimized, but functionally correct code is better than broken code."; but that's only true if there is a definitive definition of "functionally correct". For example; the lcss_brute() function you've now uploaded; is at least by one definition, 'incorrect'; as it doesn't look for duplicates. That can be corrected by adding another loop: sub lcss_brute { my( $r1, $r2, $min ) = @_; my( $l1, $l2, $swap ) = ( length $$r1, length $$r2, 0 ); ( $r1, $r2, $l1, $l2, $swap ) = ( $r2, $r1, $l2, $l1, 1 ) if $l1 > + $l2; my( $best, @solns ) = 0; for my $start ( 0 .. $l2 - 1 ) { for my $l ( reverse 1 .. $l1 - $start ) { my $substr = substr( $$r1, $start, $l ); my $o = 0; while( $o = 1+index( $$r2, $substr, $o ) ) { if( $l > $best ) { $best = length $substr; @solns = [ $substr, $start, $o-1 ]; } elsif( $l == $best ) { push @solns, [ $substr, $start, $o-1 ]; } } } } return \@solns; } [download] Which means that for your latest example it now produces similar output to lcss_all(): `[ 7:25:50.33] C:\test>lcss-test.pl [["gh", 1, 7], ["gh", 1, 9], ["da", 3, 4]]` [download] Though as you can see, because the algorithm operates in a different way to lcss_all(), it produces the matches in a different order. That could be cured by sorting by position, but then you've added another overhead for those applications that simply do not care. For example: I finally dug out the CD that contained the code I sent to my client, and I discovered why they have never encountered the false hit problem you found in the OP code. In the application, the needles (shorter strings) are always newline terminated; and the haystacks (longer strings) do not contain embedded newlines, so that particular problem can never occur in their application. So for their purposes; my OP algorithm was and is still "functionally correct". However, their primary concern was that they are/(were) processing huge volumes of data and the lcss() algorithm they were using was responsible for a high proportion of their processing time. The savings from my version of the algorithm combined with some other savings I found reduced their runtime by almost 50%. Slow and exhaustive would have been of no value to them at all. On the CD I also found that I had coded up an Inline::C version of the OP algorithm that runs even faster; but that would (almost certainly) exhibit the same 'flaw' as the OP version. Thinking back I recall they rejected that as they did not have the skills to maintain C code. None the less; with the addition of the fix I gave above for the false hits problem, that would probably satisfy the needs of many users with a lcss() requirement. So the problem is: where to go from here? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^9: A better implementation of LCSS? (What should a Longest Common Substring function return?) by toolic (Bishop) on Nov 17, 2015 at 18:08 UTC


Welcome to the Monastery
	PerlMonks