in reply to Re: Search for identical substrings

in thread Search for identical substrings

The longest common substring algorithm on that page is (m * n) time but requires (m * n) space as well. Contrast to the naive solution, which is (m * m * n) time but (m + n) memory.

That said, since the OP has strings of length 3000 characters, we're looking at only 3000 * 3000 * sizeof(uint16_t) = 18 megabytes of space. If the strings were, say, 100k each, we'd have problems.

So, the name of this site notwithstanding, here's some C code, poorly tested

/* Longest Common Substring * * C Implementation of the algorithm found at: * http://www.ics.uci.edu/~dan/class/161/notes/6/Dynamic.html */ /* This section is reproduced from the URL above. * * California law forbids anyone from making use of class * lecture notes for commercial purposes, and such activity * is expressly forbidden without the instructor's written * permission. * Longest common substring problem Given two sequences of letters, such as A = HELLO and B = ALOHA, find the longest contiguous sequence appearing in both. One solution: (assume strings have lengths m and n) For each of the m starting points of A, check for the longest common string starting at each of the n starting points of B. The checks could average (m) time a total of (m2n) time. Dynamic programming solution: Let Li, j = maximum length of common strings that end at A[i] & B[j]. Then, A[i] = B[j] Li, j = 1 + Li-1, j-1 A[i] B[j] Li, j = 0 LONGEST COMMON SUBSTRING(A,m,B,n) for i := 0 to m do Li,0 := 0 for j := 0 to n do L0,j := 0 len := 0 answer := <0,0> for i := 1 to m do for j := 1 to n do if Ai Bj then Li,j := 0 else Li,j := 1 + Li-1,j-1 if Li,j > len then len := Li,j answer = <i,j> Example: A L O H A H 0 0 0 1 0 E 0 0 0 0 0 L 0 1 0 0 0 L 0 1 0 0 0 O 0 0 2 0 0 * */ /* A type big enough to hold the string length. */ typedef long LCS_t; /* A type that can hold any element of the string. */ typedef char LCS_c; /* A should be a pointer to the beginning of the string. * m should be the index of the last character in the * string, plus one. * Likewise for B and n. * scratch must be an array of at least size m * n. */ LCS_t longest_common_substring ( LCS_c *A, LCS_t m, LCS_c *B, LCS_t n, LCS_t *scratch, LCS_t *answer_x, LCS_t *answer_y ) { LCS_t i, j; LCS_t len = 0; *answer_x = *answer_y = 0; #define L(x, y) scratch[(x) * n + (y)] for (i = 0; i < m; i++) L(i, 0) = 0; for (j = 0; j < n; j++) L(0, j) = 0; for (i = 0; i < m; i++) { for (j = 0; j < n; j++) { if (A[i] != B[j]) { L(i, j) = 0; } else { L(i, j) = 1 + L(i - 1, j - 1); if (L(i, j) > len) { len = L(i, j); *answer_x = i; *answer_y = j; } } } } #undef L return len; } #include <malloc.h> #include <stdio.h> #include <string.h> int main(int argc, char **argv) { if (argc != 3) return 2; char *A = argv[1]; char *B = argv[2]; long m = strlen(A); long n = strlen(B); long *scratch = (long *) malloc(m * n * sizeof(long)); if (!scratch) return 1; long answer_x, answer_y; long len; len = longest_common_substring(A, m, B, n, scratch, &answer_x, &answer_y); free(scratch); printf("LCS is %ld chars ending at <%ld,%ld>.\n", len, answer_x, answer_y); return 0; }

Maybe this would be a good time for me to learn Inline::C.

Replies are listed 'Best First'. | |
---|---|

Re^3: Search for identical substrings
by BrowserUk (Patriarch) on Aug 21, 2005 at 08:12 UTC |

In Section
Seekers of Perl Wisdom

Comment onRe^2: Search for identical substringsDownloadCode