comment on

The following pure-Perl implementation of Longest Common Sub String outstrips even the advanced algorithm used by String::LCSS_XS:

#! perl -slw
use strict;
use Time::HiRes qw[ time ];

sub lcssN (\$\$;$) {
    my( $ref1, $ref2, $min ) = @_;
    my( $swapped, $l1, $l2 ) = ( 0, map length( $$_ ), $ref1, $ref2 );
    ( $l2, $ref2, $l1, $ref1, $swapped ) = ( $l1, $ref1, $l2, $ref2, 1
+ ) if $l1 > $l2;
    $min = 1 unless defined $min;

    my $mask = $$ref1 x ( int( $l2 / $l1 ) + 1 );

    my @match = '';
    for my $start ( 0 .. $l1-1 ) {
        my $masked = substr( $mask, $start, $l2 ) ^ $$ref2;
        while( $masked =~ m[\0{$min,}]go ) {
            @match = (
                substr( $$ref2, $-[ 0 ], $+[ 0 ] - $-[ 0 ] ),
                ( $-[ 0 ]+$start ) % $l1,
                $-[ 0 ]
            ) if ( $+[ 0 ] - $-[ 0 ] ) > length $match[ 0 ];
        }
    }
    @match[ 2, 1 ] = @match[ 1, 2 ] if $swapped;
    return unless $match[ 0 ];
    return wantarray ? @match : $match[ 0 ];
}

our $MIN //= 10;
my $start = time;

my( @labels, @strings );
while( <> ) {
    push @labels, $_;
    push @strings, scalar <>;
}

chomp @labels; chomp @strings;

for my $i ( 0 .. $#strings ) {
    for my $j ( $i+1 .. $#strings ) {
        my( $m, $o1, $o2 ) = lcssN( $strings[ $i ], $strings[ $j ], $M
+IN );
        next unless defined $m;
        printf "%s(%d) and %s(%d): %d '%s'\n",
             $labels[ $i ], $o1,
             $labels[ $j ], $o2,
             length( $m ), $m;
    }
}

printf "Took: %.3f seconds\n", time() - $start;

__END__
## The script above
c:\test>perl -s lcssn.pl -MIN=10 -- junk90.dat 
000001(37) and 000002(872): 127 '5808821137152553645216516684787076304
+368738347768274782252043367265484547586755564151615422250715355234473
+558428710868782135070'
000008(550) and 000089(355): 10 '3252367176'
000040(219) and 000081(623): 11 '61341721171'
000046(808) and 000056(845): 12 '876526361506'
000058(837) and 000069(276): 11 '00666788082'
Took: 12.494 seconds

## A similar script that uses String::LCSS_XS on the same file
c:\test>lcss10 junk90.dat
000001(37) and 000002(872): 127 '5808821137152553645216516684787076304
+368738347768274782252043367265484547586755564151615422250715355234473
+558428710868782135070'
000008(550) and 000089(355): 10 '3252367176'
000040(219) and 000081(623): 11 '61341721171'
000046(808) and 000056(845): 12 '876526361506'
000058(837) and 000069(276): 11 '00666788082'
Took: 14.577 seconds
[download]

If I were to package this up for CPAN, the obvious namespace would be String::LCSS, especially as that module is fundamentally broken, hasn't been updated in 6 years and has outstanding bugs going back 4 years.

However, getting module maintainers to accept NIH code is fraught with frustrations; the procedure (what is that again?), for taking over maintenance of existing packages seems to be equally so.

So, what to do? Upload it as an unauthorised version? Under a different namespace? Suffer the frustrations?

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"I'd rather go naked than blow up my ass"

In reply to A better implementation of LCSS? by BrowserUk

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


P is for Practical
	PerlMonks