Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: A better implementation of LCSS? (bug?)

by toolic (Bishop)
on Nov 13, 2015 at 02:04 UTC ( [id://1147629]=note: print w/replies, xml ) Need Help??


in reply to A better implementation of LCSS?

I started throwing some miscellaneous strings at the code, but I get an unexpected result in one case. Am I using the lcssN sub properly?
my $s1 = 'xxxyyxxy'; my $s2 = 'yyyxyxx'; my( $m, $o1, $o2 ) = lcssN($s1, $s2, 1); print "$m, $o1, $o2\n"; __END__ Prints: yxxy, 4, 4 But, I expect yyx (as String::LCSS_XS produces). yxxy is not a substring of $s2.

Replies are listed 'Best First'.
Re^2: A better implementation of LCSS? (Yes)
by BrowserUk (Patriarch) on Nov 13, 2015 at 04:41 UTC

    Yes. You found a bug.

    A simpler example is 'abcdefg' & 'abcdefga'.

    What happens is this. To speed up the processing, the code xors the longer input with a string that contain the shorter string replicated until is is longer than the longer string.

    Ie. if you have 'the quick brown fox' & 'brown', the shorter is replicated and xored with the longer like so:

    the quick brown fox brownbrownbrownbrown ..........00000.....

    Then the xored result is scanned looking for contiguous runs of zeros the length of the shorter string. In this case '00000'.

    In your case and my example above, the process of replicating the shorter string creates false matches:

    xxxyyxxy yyyxyxxyyyxyxx ....0000...... False match abcdefga abcdefgabcdefg 00000000...... False match

    Which makes it amazing to me that the guys I originally wrote the code for have never come back to me. I'm not sure I even know how to contact them again.

    The obvious solution is to throw away this 'optimisation' and use another nested loop; at which point the performance gain that was the code's raison detre probably disappears :(

    A first pass at not throwing away the performance gain is this:

    sub lcssN (\$\$;$) { my( $ref1, $ref2, $min ) = @_; my( $swapped, $l1, $l2 ) = ( 0, map length( $$_ ), $ref1, $ref2 ); ( $l2, $ref2, $l1, $ref1, $swapped ) = ( $l1, $ref1, $l2, $ref2, 1 + ) if $l1 > $l2; $min = 1 unless defined $min; my $mask = $$ref1 x ( int( $l2 / $l1 ) + 1 ); my @match = ''; for my $start ( 0 .. $l1-1 ) { my $masked = substr( $mask, $start, $l2 ) ^ $$ref2; while( $masked =~ m[\0{$min,}]go ) { my $l = $+[ 0 ] - $-[ 0 ]; my $match = substr( $$ref2, $-[ 0 ], $l ); next unless 1+index $$ref1, $match; @match = ( $match, ( $-[ 0 ]+$start ) % $l1, $-[ 0 ] ) if $l > length $match[ 0 ]; } } @match[ 2, 1 ] = @match[ 1, 2 ] if $swapped; return unless $match[ 0 ]; return wantarray ? @match : $match[ 0 ]; }

    I haven't tested what affect that has on performance.

    Update:This also works for the example you posted but I haven't convinced myself that it won't fail for other inputs yet:

    sub lcssN (\$\$;$) { my( $ref1, $ref2, $min ) = @_; my( $swapped, $l1, $l2 ) = ( 0, map length( $$_ ), $ref1, $ref2 ); ( $l2, $ref2, $l1, $ref1, $swapped ) = ( $l1, $ref1, $l2, $ref2, 1 + ) if $l1 > $l2; $min = 1 unless defined $min; my $mask = $$ref1 x ( int( $l2 / $l1 ) ); my @match = ''; for my $start ( 0 .. $l1-1 ) { my $masked = substr( $mask, $start, $l2 ) ^ $$ref2; while( $masked =~ m[\0{$min,}]go ) { my $l = $+[ 0 ] - $-[ 0 ]; my $match = substr( $$ref2, $-[ 0 ], $l ); @match = ( $match, ( $-[ 0 ]+$start ) % $l1, $-[ 0 ] ) if $l > length $match[ 0 ]; } } @match[ 2, 1 ] = @match[ 1, 2 ] if $swapped; return unless $match[ 0 ]; return wantarray ? @match : $match[ 0 ]; }

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
      What we need are tests... lots of tests. I created a test script (lcss.pl) and a test input file (test.txt) which can be used to check the functionality of the various CPAN modules and other code I've stumbled upon. The focus of the testing is correct, basic functionality (not performance).

      The test data is a collection of all the CPAN test files for the 3 modules (Algorithm::LCSS, String::LCSS, String::LCSS_XS). I added the checks which caused problems for the BrowserUk code, plus a few extras so far.

      The test script can be used to run these checks on the various LCSS subs, one at a time. I included the 3 variants of BrowserUk's code, Perl code I found on wikipedia, plus this compact regex version: Longest common substring. The wiki, the regex and String::LCSS_XS pass all checks.

      What I'd really like is a way to generate tests automatically. Generating input strings is straightforward, but generating expect values is tricky without a reference model. I tried to stfw for ready-made groups of input strings and expect values, but found nothing. I may just go ahead and use one of the 3 that have no known bugs yet as a reference model.

      Here is the lcss.pl script:

      Here is the test data file test.txt:

        What I'd really like is a way to generate tests automatically

        I agree 100% that a testcase generator is the only sure way to test optimisations for this type of algorithm. However, I think I would approach that goal from a somewhat different angle.

        Firstly, I believe that it is relatively easy to code (if not so easy to prove) an exhaustive (thus slow) algorithm that is guaranteed to inspect all possibilities; and by extension find all 'best' solutions.

        Here is (what I believe to be) such an implementation:

        sub lcss_brute { my( $r1, $r2, $min ) = @_; my( $l1, $l2, $swap ) = ( length $$r1, length $$r2, 0 ); ( $r1, $r2, $l1, $l2, $swap ) = ( $r2, $r1, $l2, $l1, 1 ) if $l1 > + $l2; my( $best, @solns ) = 0; for my $start ( 0 .. $l2 - 1 ) { for my $l ( reverse 1 .. $l1 - $start ) { my $substr = substr( $$r1, $start, $l ); my $o = index( $$r2, $substr ); next if $o < 0; if( $l > $best ) { $best = length $substr; @solns = [ $substr, $start, $o ]; } elsif( $l == $best ) { push @solns, [ $substr, $start, $o ]; } } } return \@solns; }

        Supplied with the bug you found, it produces both valid solutions::

        pp lcss_brute( \'xxxyyxxy', \'yyyxyxx', 1 ); C:\test>lcss-test.pl [["yyx", 1, 3], ["yxx", 4, 4]]

        That, if others agree it can neither miss solutions nor generate false ones, takes care of the "reference model".

        Then comes the problem of ensuring you generate a set of tests that is guaranteed (or at least very likely) to explore all the possible failure modes for optimised implementations.

        Initially, that seems like a 'how long is a piece of string' (substring:) problem; but thinking back to the little reading I've done on DFA/NFA theory, in particular the observation that for algorithmic proof purposes, single letters are often substituted ("without loss of generality") for unique substrings -- eg. 'aba' is equivalent to both 'the quick the' & 'onetwoone' etc. -- then I believe that it is possible to exercise all possible failure modes by iterating all the permutations of a relatively limited set of characters to produce the 'first string' inputs; and then iterating all the variations of all the substrings of that first string to produce the 'second string' inputs.

        Please note that my use of the terms 'permutations' and 'variations' in the above is in their common understanding meanings rather than their specific meanings in the combinatorics sense. That is to say, I have yet worked out whether the above should use permutations() or combinations() or variations() or *_with_repetitions() for the two parts of the generation.

        To this end, I've come up with this as a first pass at an exhaustive testcase generator:

        #! perl -slw use strict; use Data::Dump qw[ pp ]; $Data::Dump::WIDTH = 1000; use Algorithm::Combinatorics qw[ variations_with_repetition permutatio +ns ]; sub lcss_brute { my( $r1, $r2, $min ) = @_; my( $l1, $l2, $swap ) = ( length $$r1, length $$r2, 0 ); ( $r1, $r2, $l1, $l2, $swap ) = ( $r2, $r1, $l2, $l1, 1 ) if $l1 > + $l2; my( $best, @solns ) = 0; for my $start ( 0 .. $l2 - 1 ) { for my $l ( reverse 1 .. $l1 - $start ) { my $substr = substr( $$r1, $start, $l ); my $o = index( $$r2, $substr ); next if $o < 0; if( $l > $best ) { $best = length $substr; @solns = [ $substr, $start, $o ]; } elsif( $l == $best ) { push @solns, [ $substr, $start, $o ]; } } } return \@solns; } my @chars = 'a' .. 'e'; my $iter1 = permutations( \@chars ); while( $_ = $iter1->next ) { my $long = join '', @$_; for my $l ( 2 .. $#chars ) { my $iter = variations_with_repetition( $_, $l ); while( my $r = $iter->next ) { my $short = join '', @$r; my $solns = lcss_brute( \$long, \$short, 1 ); next unless defined( $solns ); printf "\rFrom '%s' in '%s'; solns:'%s'\t\t\t", $short, $l +ong, pp $solns; } } }

        Running that against all the permutation of 'a'..'e' as the first input; and all the variations with repetition of all lengths of substring of the generated first strings, runs in around a minute. I'm not sure yet, but that may be enough?

        I'm not convinced whether I should be using permutations or combinations; and maybe it could be just variations rather than variations with repetitions; and I'm not yet convinced whether 5 characters is enough; but may be others have some inputs at this point?

        An interesting exercise would be to run this against the known failing implementations and see if it detects problems; and if those problems it detects can be logically reduced to be the same as the known problems.

        Anyway, that's as far as my logic will allow me to go at this point; let me know what you (or anyone else still watching) think.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1147629]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (4)
As of 2024-04-20 00:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found