Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re^2: A better implementation of LCSS?

by BrowserUk (Pope)
on Jan 28, 2010 at 02:59 UTC ( #820080=note: print w/replies, xml ) Need Help??


in reply to Re: A better implementation of LCSS?
in thread A better implementation of LCSS?

I thought about that, but in the normal way of things *::PP modules are fallbacks for when the *::XS version won't compile. A reliable but slow option. In this case, it is actually faster.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^3: A better implementation of LCSS?
by ikegami (Pope) on Jan 28, 2010 at 03:48 UTC

    I agree. For this module, it not relevant that it's written in Perl.

    Longest common subsequence is also abbreviated LCS, and String::LCS is not currently used.

    Update: And of course, that's not the problem you are solving. It does go to show that LCSS is a bad choice anyway.

    Algorithm::LCSS::LCSSLonguest common subsequence
    String::LCSS::lcssLonguest common substring
    Algorithm::Diff::LCSLonguest common subsequence

    String::LCSubstr?

      String::LCSubstr?

      I'm coming to the conclusion, that is the least-worst option. Now all I got do is remember my PAUSE credentials and how to package it up.

Re^3: A better implementation of LCSS?
by lima1 (Curate) on Jan 28, 2010 at 16:31 UTC
    Just out of curiosity, is it also faster when you increase the string lengths? Or when you set MIN to 1 or 2? The XS overhead could also be the problem. At least in theory it is hard to beat the LCSS_XS algorithm. But nevertheless, well done :-)

      Good question! In the benchmark cited, I used 100 1000-char string (just because that was what the benchmark of another algorithm posted here used).

      In the following examples, LCSS10.pl is the script that uses String::LCSS_XS. The -MIN= parameter to the script simply stops it printing out any results less than MIN length. It doesn't stop it looking for and finding them.

      LCSSN.pl uses my pp implementation. Here, the -MIN= parameter doesn't stop it locating smaller matches. But it does stop it from considering them when looking for the best match, which helps performance a lot.

      Increasing the length doesn't affect the performance differential much. If anything it seems to favour my implementation. The following is run on 4, 100,000 char strings:

      C:\test>perl -s LCSS10.pl -MIN=10 -- < 4x1e5.dat 000001(3463) and 000002(91858): 10 '8890235173' 000001(18712) and 000004(79151): 12 '260703543044' 000002(39758) and 000003(4141): 10 '1595533057' 000002(61266) and 000004(29466): 10 '6247963240' 000003(45661) and 000004(32254): 12 '381074855852' Took: 170.777 seconds C:\test>perl -s LCSSN.pl -MIN=10 -- < 4x1e5.dat 000001(3463) and 000002(91858): 10 '8890235173' 000001(18712) and 000004(79151): 12 '260703543044' 000002(39758) and 000003(4141): 10 '1595533057' 000002(61266) and 000004(29466): 10 '6247963240' 000003(45661) and 000004(32254): 12 '381074855852' Took: 110.078 seconds

      However, reducing the -MIN=2 does affect it, and shows String::LCSS_XS in a good light. The following are run on the same file:

      C:\test>perl -s LCSS10.pl -MIN=2 -- < 4x1e5.dat 000001(3463) and 000002(91858): 10 '8890235173' 000001(34923) and 000003(7672): 9 '826854356' 000001(18712) and 000004(79151): 12 '260703543044' 000002(39758) and 000003(4141): 10 '1595533057' 000002(61266) and 000004(29466): 10 '6247963240' 000003(45661) and 000004(32254): 12 '381074855852' Took: 171.108 seconds C:\test>perl -s LCSSN.pl -MIN=2 -- < 4x1e5.dat 000001(3463) and 000002(91858): 10 '8890235173' 000001(91904) and 000003(82429): 9 '784839043' 000001(18712) and 000004(79151): 12 '260703543044' 000002(39758) and 000003(4141): 10 '1595533057' 000002(61266) and 000004(29466): 10 '6247963240' 000003(45661) and 000004(32254): 12 '381074855852' Took: 1240.604 seconds

      All the extra locating and overwriting the best yet does affect the algorithm badly by comparison. Presumably, if you added a MIN parameter to your code, it would benefit from it when used and would then beat mine blow for blow.

      That said. Mine works best when there is large length differential between the two strings being compared. It currently makes 2 passes of the longer string for each character in the shorter string. As both are the same length in these benchmarks these tests are worst case.

      I have a notion that, if coded in C, I can reduce that to 1 pass which would redress the balance a little, but probably not enough to make it worth while for the cases where both strings are approximately the same length.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        A nice, thanks. I'll add such a min option in the next release.

        Update: I've uploaded a new develper version on CPAN with this feature.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://820080]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (4)
As of 2021-12-01 19:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    R or B?



    Results (14 votes). Check out past polls.

    Notices?