A better implementation of LCSS?

BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

The following pure-Perl implementation of Longest Common Sub String outstrips even the advanced algorithm used by String::LCSS_XS:

#! perl -slw
use strict;
use Time::HiRes qw[ time ];

sub lcssN (\$\$;$) {
    my( $ref1, $ref2, $min ) = @_;
    my( $swapped, $l1, $l2 ) = ( 0, map length( $$_ ), $ref1, $ref2 );
    ( $l2, $ref2, $l1, $ref1, $swapped ) = ( $l1, $ref1, $l2, $ref2, 1
+ ) if $l1 > $l2;
    $min = 1 unless defined $min;

    my $mask = $$ref1 x ( int( $l2 / $l1 ) + 1 );

    my @match = '';
    for my $start ( 0 .. $l1-1 ) {
        my $masked = substr( $mask, $start, $l2 ) ^ $$ref2;
        while( $masked =~ m[\0{$min,}]go ) {
            @match = (
                substr( $$ref2, $-[ 0 ], $+[ 0 ] - $-[ 0 ] ),
                ( $-[ 0 ]+$start ) % $l1,
                $-[ 0 ]
            ) if ( $+[ 0 ] - $-[ 0 ] ) > length $match[ 0 ];
        }
    }
    @match[ 2, 1 ] = @match[ 1, 2 ] if $swapped;
    return unless $match[ 0 ];
    return wantarray ? @match : $match[ 0 ];
}

our $MIN //= 10;
my $start = time;

my( @labels, @strings );
while( <> ) {
    push @labels, $_;
    push @strings, scalar <>;
}

chomp @labels; chomp @strings;

for my $i ( 0 .. $#strings ) {
    for my $j ( $i+1 .. $#strings ) {
        my( $m, $o1, $o2 ) = lcssN( $strings[ $i ], $strings[ $j ], $M
+IN );
        next unless defined $m;
        printf "%s(%d) and %s(%d): %d '%s'\n",
             $labels[ $i ], $o1,
             $labels[ $j ], $o2,
             length( $m ), $m;
    }
}

printf "Took: %.3f seconds\n", time() - $start;

__END__
## The script above
c:\test>perl -s lcssn.pl -MIN=10 -- junk90.dat 
000001(37) and 000002(872): 127 '5808821137152553645216516684787076304
+368738347768274782252043367265484547586755564151615422250715355234473
+558428710868782135070'
000008(550) and 000089(355): 10 '3252367176'
000040(219) and 000081(623): 11 '61341721171'
000046(808) and 000056(845): 12 '876526361506'
000058(837) and 000069(276): 11 '00666788082'
Took: 12.494 seconds

## A similar script that uses String::LCSS_XS on the same file
c:\test>lcss10 junk90.dat
000001(37) and 000002(872): 127 '5808821137152553645216516684787076304
+368738347768274782252043367265484547586755564151615422250715355234473
+558428710868782135070'
000008(550) and 000089(355): 10 '3252367176'
000040(219) and 000081(623): 11 '61341721171'
000046(808) and 000056(845): 12 '876526361506'
000058(837) and 000069(276): 11 '00666788082'
Took: 14.577 seconds
[download]

If I were to package this up for CPAN, the obvious namespace would be String::LCSS, especially as that module is fundamentally broken, hasn't been updated in 6 years and has outstanding bugs going back 4 years.

However, getting module maintainers to accept NIH code is fraught with frustrations; the procedure (what is that again?), for taking over maintenance of existing packages seems to be equally so.

So, what to do? Upload it as an unauthorised version? Under a different namespace? Suffer the frustrations?

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"I'd rather go naked than blow up my ass"

Comment on A better implementation of LCSS? Download Code

Replies are listed 'Best First'.

Re: A better implementation of LCSS?
by Anonymous Monk on Jan 27, 2010 at 13:58 UTC

How do I adopt or take over a module already on CPAN?

[reply]

Re: A better implementation of LCSS?
by gmargo (Hermit) on Jan 27, 2010 at 16:04 UTC

I think it's inadvisable to use the Perl-5.10-specific operator //= in some code you want to share with the world.