Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re^5: A better implementation of LCSS? (Do you have any combinatorics expertise to bring to bear?)

by toolic (Bishop)
on Nov 15, 2015 at 14:08 UTC ( [id://1147726]=note: print w/replies, xml ) Need Help??


in reply to Re^4: A better implementation of LCSS? (Do you have any combinatorics expertise to bring to bear?)
in thread A better implementation of LCSS?

I used your test generator to confirm that your brute-force solution matches both String::LCSS_XS and the wiki code.

I created another generator which represents more of a pounding fists on a keyboard approach -- random strings of random lengths as input. The fact that String::LCSS_XS::lcss_all returns a different number of longest common substrings from your brute code is giving me an ice cream headache. So, I won't be of much use in any further theoretical analysis. Here is an example pair of input strings that shows what I'm talking about:

gghdagahkk akakdadghgh BrowserUk lcss_brute: $VAR1 = [ 'gh', 'da' ]; String::LCSS_XS::lcss_all: $VAR1 = [ 'da', 'gh', 'gh' ];

After I uniq and sort the results, the 2 models match (I have checked this on millions of input string pairs).

In any case, I think I'll revise the patch I uploaded for String::LCSS to use your new code. Un-optimized, but functionally correct code is better than broken code. (UPDATE: I uploaded a new patch)

  • Comment on Re^5: A better implementation of LCSS? (Do you have any combinatorics expertise to bring to bear?)
  • Download Code

Replies are listed 'Best First'.
Re^6: A better implementation of LCSS? (Do you have any combinatorics expertise to bring to bear?)
by BrowserUk (Patriarch) on Nov 16, 2015 at 16:20 UTC

    Does String::LCSS_XS::lcss_all() return the offsets?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Yes:
      @arr = lcss_all('gghdagahkk', 'akakdadghgh'); use Data::Dumper; print Dumper(\@arr); __END__ $VAR1 = [ [ 'da', 3, 4 ], [ 'gh', 1, 7 ], [ 'gh', 1, 9 ] ];

        This brings up a question: What should an lcss() routine return?

        1. All the longest common substrings?

          If not:

        2. The 'first'.

          From the left-hand end of the string; or right-hand end; or the first discovered by the most efficient algorithm?

        3. If the longest common substring appears multiple times at different offsets;

          Should they

        4. All be returned;
        5. Or only the unique ones?
        6. The first found?

        Of course, the only real answer is: it depends upon the application requirements. It's easy to suggest that all is the most flexible solution as it allows the application to choose what it needs; but that doesn't help for applications -- like the one I was addressing when I started this thread 7 years ago -- where the application only needs one; knows that there will only be one; and is time critical enough that it cannot afford to waste time searching for others that either it knows won't exist; or it doesn't care about even if they do.

        Above you said "Un-optimized, but functionally correct code is better than broken code."; but that's only true if there is a definitive definition of "functionally correct". For example; the lcss_brute() function you've now uploaded; is at least by one definition, 'incorrect'; as it doesn't look for duplicates.

        That can be corrected by adding another loop:

        sub lcss_brute { my( $r1, $r2, $min ) = @_; my( $l1, $l2, $swap ) = ( length $$r1, length $$r2, 0 ); ( $r1, $r2, $l1, $l2, $swap ) = ( $r2, $r1, $l2, $l1, 1 ) if $l1 > + $l2; my( $best, @solns ) = 0; for my $start ( 0 .. $l2 - 1 ) { for my $l ( reverse 1 .. $l1 - $start ) { my $substr = substr( $$r1, $start, $l ); my $o = 0; while( $o = 1+index( $$r2, $substr, $o ) ) { if( $l > $best ) { $best = length $substr; @solns = [ $substr, $start, $o-1 ]; } elsif( $l == $best ) { push @solns, [ $substr, $start, $o-1 ]; } } } } return \@solns; }

        Which means that for your latest example it now produces similar output to lcss_all():

        [ 7:25:50.33] C:\test>lcss-test.pl [["gh", 1, 7], ["gh", 1, 9], ["da", 3, 4]]

        Though as you can see, because the algorithm operates in a different way to lcss_all(), it produces the matches in a different order. That could be cured by sorting by position, but then you've added another overhead for those applications that simply do not care.

        For example: I finally dug out the CD that contained the code I sent to my client, and I discovered why they have never encountered the false hit problem you found in the OP code. In the application, the needles (shorter strings) are always newline terminated; and the haystacks (longer strings) do not contain embedded newlines, so that particular problem can never occur in their application. So for their purposes; my OP algorithm was and is still "functionally correct".

        However, their primary concern was that they are/(were) processing huge volumes of data and the lcss() algorithm they were using was responsible for a high proportion of their processing time. The savings from my version of the algorithm combined with some other savings I found reduced their runtime by almost 50%. Slow and exhaustive would have been of no value to them at all.

        On the CD I also found that I had coded up an Inline::C version of the OP algorithm that runs even faster; but that would (almost certainly) exhibit the same 'flaw' as the OP version. Thinking back I recall they rejected that as they did not have the skills to maintain C code. None the less; with the addition of the fix I gave above for the false hits problem, that would probably satisfy the needs of many users with a lcss() requirement.

        So the problem is: where to go from here?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1147726]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (7)
As of 2024-04-19 08:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found