http://qs321.pair.com?node_id=819919

BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

The following pure-Perl implementation of Longest Common Sub String outstrips even the advanced algorithm used by String::LCSS_XS:

#! perl -slw use strict; use Time::HiRes qw[ time ]; sub lcssN (\$\$;$) { my( $ref1, $ref2, $min ) = @_; my( $swapped, $l1, $l2 ) = ( 0, map length( $$_ ), $ref1, $ref2 ); ( $l2, $ref2, $l1, $ref1, $swapped ) = ( $l1, $ref1, $l2, $ref2, 1 + ) if $l1 > $l2; $min = 1 unless defined $min; my $mask = $$ref1 x ( int( $l2 / $l1 ) + 1 ); my @match = ''; for my $start ( 0 .. $l1-1 ) { my $masked = substr( $mask, $start, $l2 ) ^ $$ref2; while( $masked =~ m[\0{$min,}]go ) { @match = ( substr( $$ref2, $-[ 0 ], $+[ 0 ] - $-[ 0 ] ), ( $-[ 0 ]+$start ) % $l1, $-[ 0 ] ) if ( $+[ 0 ] - $-[ 0 ] ) > length $match[ 0 ]; } } @match[ 2, 1 ] = @match[ 1, 2 ] if $swapped; return unless $match[ 0 ]; return wantarray ? @match : $match[ 0 ]; } our $MIN //= 10; my $start = time; my( @labels, @strings ); while( <> ) { push @labels, $_; push @strings, scalar <>; } chomp @labels; chomp @strings; for my $i ( 0 .. $#strings ) { for my $j ( $i+1 .. $#strings ) { my( $m, $o1, $o2 ) = lcssN( $strings[ $i ], $strings[ $j ], $M +IN ); next unless defined $m; printf "%s(%d) and %s(%d): %d '%s'\n", $labels[ $i ], $o1, $labels[ $j ], $o2, length( $m ), $m; } } printf "Took: %.3f seconds\n", time() - $start; __END__ ## The script above c:\test>perl -s lcssn.pl -MIN=10 -- junk90.dat 000001(37) and 000002(872): 127 '5808821137152553645216516684787076304 +368738347768274782252043367265484547586755564151615422250715355234473 +558428710868782135070' 000008(550) and 000089(355): 10 '3252367176' 000040(219) and 000081(623): 11 '61341721171' 000046(808) and 000056(845): 12 '876526361506' 000058(837) and 000069(276): 11 '00666788082' Took: 12.494 seconds ## A similar script that uses String::LCSS_XS on the same file c:\test>lcss10 junk90.dat 000001(37) and 000002(872): 127 '5808821137152553645216516684787076304 +368738347768274782252043367265484547586755564151615422250715355234473 +558428710868782135070' 000008(550) and 000089(355): 10 '3252367176' 000040(219) and 000081(623): 11 '61341721171' 000046(808) and 000056(845): 12 '876526361506' 000058(837) and 000069(276): 11 '00666788082' Took: 14.577 seconds

If I were to package this up for CPAN, the obvious namespace would be String::LCSS, especially as that module is fundamentally broken, hasn't been updated in 6 years and has outstanding bugs going back 4 years.

However, getting module maintainers to accept NIH code is fraught with frustrations; the procedure (what is that again?), for taking over maintenance of existing packages seems to be equally so.

So, what to do? Upload it as an unauthorised version? Under a different namespace? Suffer the frustrations?


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re: A better implementation of LCSS?
by Anonymous Monk on Jan 27, 2010 at 13:58 UTC
Re: A better implementation of LCSS?
by gmargo (Hermit) on Jan 27, 2010 at 16:04 UTC

    I think it's inadvisable to use the Perl-5.10-specific operator //= in some code you want to share with the world.

      Good point! Easily changed.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
      No, no its not :)
Re: A better implementation of LCSS?
by JavaFan (Canon) on Jan 27, 2010 at 13:55 UTC
    I'd upload it under a different namespace. It's a pity that namespaces are "handed" out on a "first asked - first gotten" bases. It means that crappy/unsupported modules can get the "good" names. However, it seems to work ok in practise, and I can only see huge drawbacks against any other system. (Any other system that leads to a "better" (for some values of better) allocation of names (or reallocation of names) requires more people to do work - which includes making decisions that makes some people unhappy).
Re: A better implementation of LCSS? (bug?)
by toolic (Bishop) on Nov 13, 2015 at 02:04 UTC
    I started throwing some miscellaneous strings at the code, but I get an unexpected result in one case. Am I using the lcssN sub properly?
    my $s1 = 'xxxyyxxy'; my $s2 = 'yyyxyxx'; my( $m, $o1, $o2 ) = lcssN($s1, $s2, 1); print "$m, $o1, $o2\n"; __END__ Prints: yxxy, 4, 4 But, I expect yyx (as String::LCSS_XS produces). yxxy is not a substring of $s2.

      Yes. You found a bug.

      A simpler example is 'abcdefg' & 'abcdefga'.

      What happens is this. To speed up the processing, the code xors the longer input with a string that contain the shorter string replicated until is is longer than the longer string.

      Ie. if you have 'the quick brown fox' & 'brown', the shorter is replicated and xored with the longer like so:

      the quick brown fox brownbrownbrownbrown ..........00000.....

      Then the xored result is scanned looking for contiguous runs of zeros the length of the shorter string. In this case '00000'.

      In your case and my example above, the process of replicating the shorter string creates false matches:

      xxxyyxxy yyyxyxxyyyxyxx ....0000...... False match abcdefga abcdefgabcdefg 00000000...... False match

      Which makes it amazing to me that the guys I originally wrote the code for have never come back to me. I'm not sure I even know how to contact them again.

      The obvious solution is to throw away this 'optimisation' and use another nested loop; at which point the performance gain that was the code's raison detre probably disappears :(

      A first pass at not throwing away the performance gain is this:

      sub lcssN (\$\$;$) { my( $ref1, $ref2, $min ) = @_; my( $swapped, $l1, $l2 ) = ( 0, map length( $$_ ), $ref1, $ref2 ); ( $l2, $ref2, $l1, $ref1, $swapped ) = ( $l1, $ref1, $l2, $ref2, 1 + ) if $l1 > $l2; $min = 1 unless defined $min; my $mask = $$ref1 x ( int( $l2 / $l1 ) + 1 ); my @match = ''; for my $start ( 0 .. $l1-1 ) { my $masked = substr( $mask, $start, $l2 ) ^ $$ref2; while( $masked =~ m[\0{$min,}]go ) { my $l = $+[ 0 ] - $-[ 0 ]; my $match = substr( $$ref2, $-[ 0 ], $l ); next unless 1+index $$ref1, $match; @match = ( $match, ( $-[ 0 ]+$start ) % $l1, $-[ 0 ] ) if $l > length $match[ 0 ]; } } @match[ 2, 1 ] = @match[ 1, 2 ] if $swapped; return unless $match[ 0 ]; return wantarray ? @match : $match[ 0 ]; }

      I haven't tested what affect that has on performance.

      Update:This also works for the example you posted but I haven't convinced myself that it won't fail for other inputs yet:

      sub lcssN (\$\$;$) { my( $ref1, $ref2, $min ) = @_; my( $swapped, $l1, $l2 ) = ( 0, map length( $$_ ), $ref1, $ref2 ); ( $l2, $ref2, $l1, $ref1, $swapped ) = ( $l1, $ref1, $l2, $ref2, 1 + ) if $l1 > $l2; $min = 1 unless defined $min; my $mask = $$ref1 x ( int( $l2 / $l1 ) ); my @match = ''; for my $start ( 0 .. $l1-1 ) { my $masked = substr( $mask, $start, $l2 ) ^ $$ref2; while( $masked =~ m[\0{$min,}]go ) { my $l = $+[ 0 ] - $-[ 0 ]; my $match = substr( $$ref2, $-[ 0 ], $l ); @match = ( $match, ( $-[ 0 ]+$start ) % $l1, $-[ 0 ] ) if $l > length $match[ 0 ]; } } @match[ 2, 1 ] = @match[ 1, 2 ] if $swapped; return unless $match[ 0 ]; return wantarray ? @match : $match[ 0 ]; }

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
      In the absence of evidence, opinion is indistinguishable from prejudice.
        What we need are tests... lots of tests. I created a test script (lcss.pl) and a test input file (test.txt) which can be used to check the functionality of the various CPAN modules and other code I've stumbled upon. The focus of the testing is correct, basic functionality (not performance).

        The test data is a collection of all the CPAN test files for the 3 modules (Algorithm::LCSS, String::LCSS, String::LCSS_XS). I added the checks which caused problems for the BrowserUk code, plus a few extras so far.

        The test script can be used to run these checks on the various LCSS subs, one at a time. I included the 3 variants of BrowserUk's code, Perl code I found on wikipedia, plus this compact regex version: Longest common substring. The wiki, the regex and String::LCSS_XS pass all checks.

        What I'd really like is a way to generate tests automatically. Generating input strings is straightforward, but generating expect values is tricky without a reference model. I tried to stfw for ready-made groups of input strings and expect values, but found nothing. I may just go ahead and use one of the 3 that have no known bugs yet as a reference model.

        Here is the lcss.pl script:

        Here is the test data file test.txt:

Re: A better implementation of LCSS?
by toolic (Bishop) on Nov 11, 2015 at 20:17 UTC
    BrowserUk, did you ever upload your code to CPAN?

    In a related question the other day, ikegami posted a solution using String::LCSS_XS, which he later deleted. This led me to investigate what other modules which find longest common substrings are available on CPAN. Here is what I found:

    1. String::LCSS_XS
    2. Algorithm::LCSS
    3. String::LCSS
    4. Tree::Suffix

    Note that there are other modules with similar names, but they relate to longest common subsequences.

    String::LCSS_XS seems to be the best of the bunch. It has one reported bug, but the bug is simple to avoid, and there is even a potential patch.

    The other 3 modules have reported functional bugs for which there are no specified workarounds or patches.

    Algorithm::LCSS was last updated in 2003 (which was a magical year for LCSS modules, apparently). The author's last activity on CPAN (for other modules) was in 2008. Soon after that, 2 bugs were filed, but the author never responded to either one.

    The POD for Tree::Suffix indicates that the author has ceased to maintain the module due to numerous bugs in an external dependency.

      No. I never have.

      As has often been the case, my immediate needs were satisfied and some other project came along that demanded my time (whether for financial or intellectual reasons) and I've never been back to look at it.

      If the code in this thread still stands up, and anyone is interested in packaging it, they have both my blessing and support.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
      In the absence of evidence, opinion is indistinguishable from prejudice.
        I adapted your code to match the user interface of String::LCSS, and I uploaded it as a patch: https://rt.cpan.org/Ticket/Display.html?id=32036

        I also patched the test to prove that your code fixes the reported functional bugs. Perhaps this will lower the barrier for someone to upload a new version of this module to CPAN.

        UPDATE: I sent the module author an email, offering my services as co-maintainer. Waiting for a response...

      Algorithm::LCSS's documentation says it finds the longest subsequence (axaxaxa + ayayaya = aaaaa), not the longest substring, but then it compares itself to String:LCSS?!? but it does indeed find the longest substring like String::LCSS.

        Good point. When I first read through the POD, I mistakenly thought the terms "substring" and "subsequence" were synonymous. After playing around with the *::LCS modules, I realized there is a huge distinction between the terms. Perhaps the module author was under the same misconception. I agree with you that the Algorithm::LCSS POD should be clarified. UPDATE: I see you've logged a bug report.
      UPDATE: ahh, nevermind. BrowserUk just debunked this...


      Somewhat related...

      For what it's worth, I used Memoize on the String::LCSS::lcss sub, and the increase in performance is huge. In fact, String::LCSS is faster than String::LCSS_XS.

      The String::LCSS_XS POD shows these Benchmark results (which I was able to reproduce):

      Rate String::LCSS String::LCSS_XS String::LCSS 60.9/s -- -100% String::LCSS_XS 84746/s 138966% --

      Here are the results with Memoize:

      String::LCSS version = 0.12 String::LCSS_XS version = 1.2 >>>the quick brown fox <<< >>>the quick brown fox <<< Rate LCSS_XS LCSS LCSS_XS 195695/s -- -27% LCSS 268817/s 37% --

      Here is the code to run it:

      Keep in mind that String::LCSS has critical bugs.

        For what it's worth, I used Memoize on the String::LCSS::lcss sub, and the increase in performance is huge. In fact, String::LCSS is faster than String::LCSS_XS.

        Sorry, but that is a useless test. You are always testing the same two strings, which means that you are simply getting back the same result each time after the first, without having to re-perform the algorithm.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: A better implementation of LCSS?
by foolishmortal (Novice) on Jan 28, 2010 at 02:56 UTC
    String::LCSS::PP ?

      I thought about that, but in the normal way of things *::PP modules are fallbacks for when the *::XS version won't compile. A reliable but slow option. In this case, it is actually faster.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        I agree. For this module, it not relevant that it's written in Perl.

        Longest common subsequence is also abbreviated LCS, and String::LCS is not currently used.

        Update: And of course, that's not the problem you are solving. It does go to show that LCSS is a bad choice anyway.

        Algorithm::LCSS::LCSSLonguest common subsequence
        String::LCSS::lcssLonguest common substring
        Algorithm::Diff::LCSLonguest common subsequence

        String::LCSubstr?

        Just out of curiosity, is it also faster when you increase the string lengths? Or when you set MIN to 1 or 2? The XS overhead could also be the problem. At least in theory it is hard to beat the LCSS_XS algorithm. But nevertheless, well done :-)