A better implementation of LCSS?


"be consistent"
	PerlMonks

A better implementation of LCSS?

by BrowserUk (Patriarch)

on Jan 27, 2010 at 13:34 UTC ( [id://819919]=perlquestion: print w/replies, xml )

Need Help??

BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

The following pure-Perl implementation of Longest Common Sub String outstrips even the advanced algorithm used by String::LCSS_XS:

#! perl -slw
use strict;
use Time::HiRes qw[ time ];

sub lcssN (\$\$;$) {
    my( $ref1, $ref2, $min ) = @_;
    my( $swapped, $l1, $l2 ) = ( 0, map length( $$_ ), $ref1, $ref2 );
    ( $l2, $ref2, $l1, $ref1, $swapped ) = ( $l1, $ref1, $l2, $ref2, 1
+ ) if $l1 > $l2;
    $min = 1 unless defined $min;

    my $mask = $$ref1 x ( int( $l2 / $l1 ) + 1 );

    my @match = '';
    for my $start ( 0 .. $l1-1 ) {
        my $masked = substr( $mask, $start, $l2 ) ^ $$ref2;
        while( $masked =~ m[\0{$min,}]go ) {
            @match = (
                substr( $$ref2, $-[ 0 ], $+[ 0 ] - $-[ 0 ] ),
                ( $-[ 0 ]+$start ) % $l1,
                $-[ 0 ]
            ) if ( $+[ 0 ] - $-[ 0 ] ) > length $match[ 0 ];
        }
    }
    @match[ 2, 1 ] = @match[ 1, 2 ] if $swapped;
    return unless $match[ 0 ];
    return wantarray ? @match : $match[ 0 ];
}

our $MIN //= 10;
my $start = time;

my( @labels, @strings );
while( <> ) {
    push @labels, $_;
    push @strings, scalar <>;
}

chomp @labels; chomp @strings;

for my $i ( 0 .. $#strings ) {
    for my $j ( $i+1 .. $#strings ) {
        my( $m, $o1, $o2 ) = lcssN( $strings[ $i ], $strings[ $j ], $M
+IN );
        next unless defined $m;
        printf "%s(%d) and %s(%d): %d '%s'\n",
             $labels[ $i ], $o1,
             $labels[ $j ], $o2,
             length( $m ), $m;
    }
}

printf "Took: %.3f seconds\n", time() - $start;

__END__
## The script above
c:\test>perl -s lcssn.pl -MIN=10 -- junk90.dat 
000001(37) and 000002(872): 127 '5808821137152553645216516684787076304
+368738347768274782252043367265484547586755564151615422250715355234473
+558428710868782135070'
000008(550) and 000089(355): 10 '3252367176'
000040(219) and 000081(623): 11 '61341721171'
000046(808) and 000056(845): 12 '876526361506'
000058(837) and 000069(276): 11 '00666788082'
Took: 12.494 seconds

## A similar script that uses String::LCSS_XS on the same file
c:\test>lcss10 junk90.dat
000001(37) and 000002(872): 127 '5808821137152553645216516684787076304
+368738347768274782252043367265484547586755564151615422250715355234473
+558428710868782135070'
000008(550) and 000089(355): 10 '3252367176'
000040(219) and 000081(623): 11 '61341721171'
000046(808) and 000056(845): 12 '876526361506'
000058(837) and 000069(276): 11 '00666788082'
Took: 14.577 seconds
[download]

If I were to package this up for CPAN, the obvious namespace would be String::LCSS, especially as that module is fundamentally broken, hasn't been updated in 6 years and has outstanding bugs going back 4 years.

However, getting module maintainers to accept NIH code is fraught with frustrations; the procedure (what is that again?), for taking over maintenance of existing packages seems to be equally so.

So, what to do? Upload it as an unauthorised version? Under a different namespace? Suffer the frustrations?

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"I'd rather go naked than blow up my ass"

Comment on A better implementation of LCSS? Download Code

Replies are listed 'Best First'.

Re: A better implementation of LCSS?
by Anonymous Monk on Jan 27, 2010 at 13:58 UTC

How do I adopt or take over a module already on CPAN?

Re: A better implementation of LCSS?
by gmargo (Hermit) on Jan 27, 2010 at 16:04 UTC

I think it's inadvisable to use the Perl-5.10-specific operator //= in some code you want to share with the world.

Re^2: A better implementation of LCSS?

by BrowserUk (Patriarch) on Jan 27, 2010 at 16:28 UTC

Good point! Easily changed.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"I'd rather go naked than blow up my ass"

Re^2: A better implementation of LCSS?

by Anonymous Monk on Jan 27, 2010 at 16:07 UTC

No, no its not :)

Re: A better implementation of LCSS?
by JavaFan (Canon) on Jan 27, 2010 at 13:55 UTC

I'd upload it under a different namespace. It's a pity that namespaces are "handed" out on a "first asked - first gotten" bases. It means that crappy/unsupported modules can get the "good" names. However, it seems to work ok in practise, and I can only see huge drawbacks against any other system. (Any other system that leads to a "better" (for some values of better) allocation of names (or reallocation of names) requires more people to do work - which includes making decisions that makes some people unhappy).

Re: A better implementation of LCSS? (bug?)
by toolic (Bishop) on Nov 13, 2015 at 02:04 UTC

my $s1 = 'xxxyyxxy';
my $s2 = 'yyyxyxx';
my( $m, $o1, $o2 ) = lcssN($s1, $s2, 1);
print "$m, $o1, $o2\n";

__END__

Prints:

yxxy, 4, 4

But, I expect yyx (as String::LCSS_XS produces).
yxxy is not a substring of $s2.
[download]

Re^2: A better implementation of LCSS? (Yes)

by BrowserUk (Patriarch) on Nov 13, 2015 at 04:41 UTC

Yes. You found a bug.

A simpler example is 'abcdefg' & 'abcdefga'.

What happens is this. To speed up the processing, the code xors the longer input with a string that contain the shorter string replicated until is is longer than the longer string.

Ie. if you have 'the quick brown fox' & 'brown', the shorter is replicated and xored with the longer like so:

the quick brown fox
brownbrownbrownbrown
..........00000.....
[download]

Then the xored result is scanned looking for contiguous runs of zeros the length of the shorter string. In this case '00000'.

In your case and my example above, the process of replicating the shorter string creates false matches:

xxxyyxxy
yyyxyxxyyyxyxx
....0000...... False match

abcdefga
abcdefgabcdefg
00000000...... False match
[download]

Which makes it amazing to me that the guys I originally wrote the code for have never come back to me. I'm not sure I even know how to contact them again.

The obvious solution is to throw away this 'optimisation' and use another nested loop; at which point the performance gain that was the code's raison detre probably disappears :(

A first pass at not throwing away the performance gain is this:

sub lcssN (\$\$;$) {
    my( $ref1, $ref2, $min ) = @_;
    my( $swapped, $l1, $l2 ) = ( 0, map length( $$_ ), $ref1, $ref2 );
    ( $l2, $ref2, $l1, $ref1, $swapped ) = ( $l1, $ref1, $l2, $ref2, 1
+ ) if $l1 > $l2;
    $min = 1 unless defined $min;

    my $mask = $$ref1 x ( int( $l2 / $l1 ) + 1 );

    my @match = '';
    for my $start ( 0 .. $l1-1 ) {
        my $masked = substr( $mask, $start, $l2 ) ^ $$ref2;
        while( $masked =~ m[\0{$min,}]go ) {
            my $l = $+[ 0 ] - $-[ 0 ];
            my $match = substr( $$ref2, $-[ 0 ], $l );
            next unless 1+index $$ref1, $match;
            @match = (
                $match,
                ( $-[ 0 ]+$start ) % $l1,
                $-[ 0 ]
            ) if $l > length $match[ 0 ];
        }
    }
    @match[ 2, 1 ] = @match[ 1, 2 ] if $swapped;
    return unless $match[ 0 ];
    return wantarray ? @match : $match[ 0 ];
}
[download]

I haven't tested what affect that has on performance.

Update:This also works for the example you posted but I haven't convinced myself that it won't fail for other inputs yet:

sub lcssN (\$\$;$) {
    my( $ref1, $ref2, $min ) = @_;
    my( $swapped, $l1, $l2 ) = ( 0, map length( $$_ ), $ref1, $ref2 );
    ( $l2, $ref2, $l1, $ref1, $swapped ) = ( $l1, $ref1, $l2, $ref2, 1
+ ) if $l1 > $l2;
    $min = 1 unless defined $min;

    my $mask = $$ref1 x ( int( $l2 / $l1 ) );

    my @match = '';
    for my $start ( 0 .. $l1-1 ) {
        my $masked = substr( $mask, $start, $l2 ) ^ $$ref2;
        while( $masked =~ m[\0{$min,}]go ) {
            my $l = $+[ 0 ] - $-[ 0 ];
            my $match = substr( $$ref2, $-[ 0 ], $l );
            @match = (
                $match,
                ( $-[ 0 ]+$start ) % $l1,
                $-[ 0 ]
            ) if $l > length $match[ 0 ];
        }
    }
    @match[ 2, 1 ] = @match[ 1, 2 ] if $swapped;
    return unless $match[ 0 ];
    return wantarray ? @match : $match[ 0 ];
}
[download]

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)

In the absence of evidence, opinion is indistinguishable from prejudice.

[reply]
[d/l]
[select]

Re^3: A better implementation of LCSS? (testing ... 1,2,3)

by toolic (Bishop) on Nov 14, 2015 at 03:32 UTC

The test data is a collection of all the CPAN test files for the 3 modules (Algorithm::LCSS, String::LCSS, String::LCSS_XS). I added the checks which caused problems for the BrowserUk code, plus a few extras so far.

The test script can be used to run these checks on the various LCSS subs, one at a time. I included the 3 variants of BrowserUk's code, Perl code I found on wikipedia, plus this compact regex version: Longest common substring. The wiki, the regex and String::LCSS_XS pass all checks.

What I'd really like is a way to generate tests automatically. Generating input strings is straightforward, but generating expect values is tricky without a reference model. I tried to stfw for ready-made groups of input strings and expect values, but found nothing. I may just go ahead and use one of the 3 that have no known bugs yet as a reference model.

Here is the lcss.pl script:

Read more... (8 kB)

Here is the test data file test.txt:

Read more... (2 kB)

[reply]
[d/l]
[select]

Re^4: A better implementation of LCSS? (Do you have any combinatorics expertise to bring to bear?)

by BrowserUk (Patriarch) on Nov 14, 2015 at 09:44 UTC

Re^5: A better implementation of LCSS? (Do you have any combinatorics expertise to bring to bear?)

by toolic (Bishop) on Nov 15, 2015 at 14:08 UTC

Some notes below your chosen depth have not been shown here

Re: A better implementation of LCSS?
by toolic (Bishop) on Nov 11, 2015 at 20:17 UTC

In a related question the other day, ikegami posted a solution using String::LCSS_XS, which he later deleted. This led me to investigate what other modules which find longest common substrings are available on CPAN. Here is what I found:

Note that there are other modules with similar names, but they relate to longest common subsequences.

String::LCSS_XS seems to be the best of the bunch. It has one reported bug, but the bug is simple to avoid, and there is even a potential patch.

The other 3 modules have reported functional bugs for which there are no specified workarounds or patches.

Algorithm::LCSS was last updated in 2003 (which was a magical year for LCSS modules, apparently). The author's last activity on CPAN (for other modules) was in 2008. Soon after that, 2 bugs were filed, but the author never responded to either one.

The POD for Tree::Suffix indicates that the author has ceased to maintain the module due to numerous bugs in an external dependency.

Re^2: A better implementation of LCSS?

by BrowserUk (Patriarch) on Nov 11, 2015 at 21:07 UTC

No. I never have.

As has often been the case, my immediate needs were satisfied and some other project came along that demanded my time (whether for financial or intellectual reasons) and I've never been back to look at it.

If the code in this thread still stands up, and anyone is interested in packaging it, they have both my blessing and support.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)

In the absence of evidence, opinion is indistinguishable from prejudice.

Re^3: A better implementation of LCSS?

by toolic (Bishop) on Nov 12, 2015 at 02:42 UTC

https://rt.cpan.org/Ticket/Display.html?id=32036

I also patched the test to prove that your code fixes the reported functional bugs. Perhaps this will lower the barrier for someone to upload a new version of this module to CPAN.

UPDATE: I sent the module author an email, offering my services as co-maintainer. Waiting for a response...

Re^2: A better implementation of LCSS?

by ikegami (Patriarch) on Nov 16, 2015 at 21:04 UTC

Algorithm::LCSS's documentation says it finds the longest subsequence (axaxaxa + ayayaya = aaaaa), ~~not the longest substring, but then it compares itself to String:LCSS?!?~~ but it does indeed find the longest substring like String::LCSS.

Re^3: A better implementation of LCSS?

by toolic (Bishop) on Nov 16, 2015 at 21:42 UTC

Algorithm::LCSS

Re^2: A better implementation of LCSS? (Memoize)

by toolic (Bishop) on Nov 18, 2015 at 21:02 UTC

For what it's worth, I used Memoize on the String::LCSS::lcss sub, and the increase in performance is huge. In fact, String::LCSS is faster than String::LCSS_XS.

The String::LCSS_XS POD shows these Benchmark results (which I was able to reproduce):

                   Rate    String::LCSS String::LCSS_XS
String::LCSS     60.9/s              --           -100%
String::LCSS_XS 84746/s         138966%              --
[download]

Here are the results with Memoize:

String::LCSS    version = 0.12
String::LCSS_XS version = 1.2
>>>the quick brown fox <<<
>>>the quick brown fox <<<

            Rate LCSS_XS    LCSS
LCSS_XS 195695/s      --    -27%
LCSS    268817/s     37%      --
[download]

Here is the code to run it:

Read more... (1156 Bytes)

Keep in mind that String::LCSS has critical bugs.

[reply]
[d/l]
[select]

Re^3: A better implementation of LCSS? (Memoize)

by BrowserUk (Patriarch) on Nov 18, 2015 at 21:33 UTC

For what it's worth, I used Memoize on the String::LCSS::lcss sub, and the increase in performance is huge. In fact, String::LCSS is faster than String::LCSS_XS.

Sorry, but that is a useless test. You are always testing the same two strings, which means that you are simply getting back the same result each time after the first, without having to re-perform the algorithm.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)

In the absence of evidence, opinion is indistinguishable from prejudice.

Re^4: A better implementation of LCSS? (Memoize)

by toolic (Bishop) on Nov 18, 2015 at 21:36 UTC

Re^5: A better implementation of LCSS? (Memoize)

by BrowserUk (Patriarch) on Nov 18, 2015 at 21:52 UTC

Re: A better implementation of LCSS?
by foolishmortal (Novice) on Jan 28, 2010 at 02:56 UTC

String::LCSS::PP ?

Re^2: A better implementation of LCSS?

by BrowserUk (Patriarch) on Jan 28, 2010 at 02:59 UTC

I thought about that, but in the normal way of things *::PP modules are fallbacks for when the *::XS version won't compile. A reliable but slow option. In this case, it is actually faster.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"I'd rather go naked than blow up my ass"

Re^3: A better implementation of LCSS?

by ikegami (Patriarch) on Jan 28, 2010 at 03:48 UTC

I agree. For this module, it not relevant that it's written in Perl.

Longest common subsequence is also abbreviated LCS, and String::LCS is not currently used.

Update: And of course, that's not the problem you are solving. It does go to show that LCSS is a bad choice anyway.

Algorithm::LCSS::LCSS	Longuest common subsequence
String::LCSS::lcss	Longuest common substring
Algorithm::Diff::LCS	Longuest common subsequence

String::LCSubstr?

Re^4: A better implementation of LCSS?

by BrowserUk (Patriarch) on Jan 28, 2010 at 06:52 UTC

Re^3: A better implementation of LCSS?

by lima1 (Curate) on Jan 28, 2010 at 16:31 UTC

Just out of curiosity, is it also faster when you increase the string lengths? Or when you set MIN to 1 or 2? The XS overhead could also be the problem. At least in theory it is hard to beat the LCSS_XS algorithm. But nevertheless, well done :-)

Re^4: A better implementation of LCSS?

by BrowserUk (Patriarch) on Jan 28, 2010 at 23:50 UTC

Re^5: A better implementation of LCSS?

by lima1 (Curate) on Jan 29, 2010 at 10:14 UTC

Back to Seekers of Perl Wisdom

Log In^?

Domain Nodelet^?

www.com | www.net | www.org

Node Status^?

node history
Node Type: perlquestion [id://819919]
Approved by marto
Front-paged by planetscape
help

Chatterbox^?

How do I use this? • Last hour • Other CB clients

Other Users^?

Others wandering the Monastery: (3)

As of 2024-04-19 17:09 GMT

Sections^?

Information^?

Find Nodes^?

Leftovers^?

Today I Learned

Voting Booth^?

No recent polls found