Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re: Search for identical substrings

by hv (Prior)
on Aug 19, 2005 at 01:44 UTC ( [id://485000]=note: print w/replies, xml ) Need Help??


in reply to Search for identical substrings

I recommend taking a look at the "longest common substring" section of this page on Dynamic Programming. These algorithms are pretty simple, and pretty fast.

Note that this type of iteration would benefit massively from converting to C - I'd recommend using Inline::C to convert just that one function.

(If time permits over the weekend I might have a go at that. But it'd be nice to have a decent data set to test it against.)

Hugo

Replies are listed 'Best First'.
Re^2: Search for identical substrings
by TilRMan (Friar) on Aug 21, 2005 at 05:57 UTC

    The longest common substring algorithm on that page is (m * n) time but requires (m * n) space as well. Contrast to the naive solution, which is (m * m * n) time but (m + n) memory.

    That said, since the OP has strings of length 3000 characters, we're looking at only 3000 * 3000 * sizeof(uint16_t) = 18 megabytes of space. If the strings were, say, 100k each, we'd have problems.

    So, the name of this site notwithstanding, here's some C code, poorly tested

    Maybe this would be a good time for me to learn Inline::C.

      Here's my Inline C implementation.

      #! perl -slw use strict; #use Inline 'INFO'; use Inline C => 'DATA', NAME => 'LCS', CLEAN_AFTER_BUILD => 1; my( $len, $offset0, $offset1 ) = LCS( @ARGV ); $ARGV[ 0 ] =~ s[(.{$offset0})(.{$len})][$1<$2>]; $ARGV[ 1 ] =~ s[(.{$offset1})(.{$len})][$1<$2>]; print for @ARGV; __END__ [ 9:10:28.57] P:\test>DynLCS-C hello aloha hel<lo> a<lo>ha __C__ #define IDX( x, y ) (((y) * an)+(x)) /* LONGEST COMMON SUBSTRING(A,m,B,n) for i := 0 to m do Li,0 := 0 for j := 0 to n do L0,j := 0 len := 0 answer := <0,0> for i := 1 to m do for j := 1 to n do if Ai ? Bj then Li,j := 0 else Li,j := 1 + Li-1,j-1 if Li,j > len then len := Li,j answer = <i,j> */ void LCS ( char* a, char*b ) { Inline_Stack_Vars; int an = strlen( a ); int bn = strlen( b ); int*L; int len = 0; int answer[2] = { 0,0 }; int i, j; Newz( 42, L, an * bn, int ); for( i = 1; i < an; i++ ) { for( j = 1; j < bn; j++ ) { if( a[ i ] != b[ j ] ) { L[ IDX(i,j) ] = 0; } else { L[ IDX(i,j) ] = 1 + L[ IDX(i-1, j-1) ]; if( L[ IDX(i,j) ] > len ) { // xs(70) len = L[ IDX(i,j) ]; answer[ 0 ] = i; answer[ 1 ] = j; } } } } Safefree( L ); Inline_Stack_Reset; Inline_Stack_Push(sv_2mortal(newSViv( len ))); Inline_Stack_Push(sv_2mortal(newSViv( answer[ 0 ] - len + 1 ))); Inline_Stack_Push(sv_2mortal(newSViv( answer[ 1 ] - len + 1 ))); Inline_Stack_Done; }

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://485000]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (3)
As of 2024-04-15 05:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found