note
TilRMan
<p>The longest common substring algorithm on that page is (m * n) time but requires (m * n) space as well. Contrast to the naive solution, which is (m * m * n) time but (m + n) memory.
<p>That said, since the OP has strings of length 3000 characters, we're looking at only 3000 * 3000 * sizeof(uint16_t) = 18 megabytes of space. If the strings were, say, 100k each, we'd have problems.
<p>So, the name of this site notwithstanding, here's some C code, poorly tested <readmore>
<code>
/* Longest Common Substring
*
* C Implementation of the algorithm found at:
* http://www.ics.uci.edu/~dan/class/161/notes/6/Dynamic.html
*/
/* This section is reproduced from the URL above.
*
* California law forbids anyone from making use of class
* lecture notes for commercial purposes, and such activity
* is expressly forbidden without the instructor's written
* permission.
*
Longest common substring problem
Given two sequences of letters, such as A = HELLO and
B = ALOHA, find the longest contiguous sequence appearing
in both.
One solution: (assume strings have lengths m and n)
For each of the m starting points of A, check for the
longest common string starting at each of the n starting
points of B.
The checks could average (m) time a total of (m2n) time.
Dynamic programming solution:
Let Li, j = maximum length of common strings that end at
A[i] & B[j].
Then,
A[i] = B[j] Li, j = 1 + Li-1, j-1
A[i] B[j] Li, j = 0
LONGEST COMMON SUBSTRING(A,m,B,n)
for i := 0 to m do Li,0 := 0
for j := 0 to n do L0,j := 0
len := 0
answer := <0,0>
for i := 1 to m do
for j := 1 to n do
if Ai Bj then
Li,j := 0
else
Li,j := 1 + Li-1,j-1
if Li,j > len then
len := Li,j
answer = <i,j>
Example:
A L O H A
H 0 0 0 1 0
E 0 0 0 0 0
L 0 1 0 0 0
L 0 1 0 0 0
O 0 0 2 0 0
*
*/
/* A type big enough to hold the string length. */
typedef long LCS_t;
/* A type that can hold any element of the string. */
typedef char LCS_c;
/* A should be a pointer to the beginning of the string.
* m should be the index of the last character in the
* string, plus one.
* Likewise for B and n.
* scratch must be an array of at least size m * n.
*/
LCS_t longest_common_substring
( LCS_c *A, LCS_t m,
LCS_c *B, LCS_t n,
LCS_t *scratch,
LCS_t *answer_x, LCS_t *answer_y
)
{
LCS_t i, j;
LCS_t len = 0;
*answer_x = *answer_y = 0;
#define L(x, y) scratch[(x) * n + (y)]
for (i = 0; i < m; i++) L(i, 0) = 0;
for (j = 0; j < n; j++) L(0, j) = 0;
for (i = 0; i < m; i++)
{
for (j = 0; j < n; j++)
{
if (A[i] != B[j])
{
L(i, j) = 0;
}
else
{
L(i, j) = 1 + L(i - 1, j - 1);
if (L(i, j) > len)
{
len = L(i, j);
*answer_x = i;
*answer_y = j;
}
}
}
}
#undef L
return len;
}
#include <malloc.h>
#include <stdio.h>
#include <string.h>
int main(int argc, char **argv)
{
if (argc != 3) return 2;
char *A = argv[1];
char *B = argv[2];
long m = strlen(A);
long n = strlen(B);
long *scratch = (long *) malloc(m * n * sizeof(long));
if (!scratch) return 1;
long answer_x, answer_y;
long len;
len = longest_common_substring(A, m, B, n, scratch,
&answer_x, &answer_y);
free(scratch);
printf("LCS is %ld chars ending at <%ld,%ld>.\n",
len, answer_x, answer_y);
return 0;
}
</code>
</readmore>
<p>Maybe this would be a good time for me to learn Inline::C.
484593
485000