Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

The longest common substring algorithm on that page is (m * n) time but requires (m * n) space as well. Contrast to the naive solution, which is (m * m * n) time but (m + n) memory.

That said, since the OP has strings of length 3000 characters, we're looking at only 3000 * 3000 * sizeof(uint16_t) = 18 megabytes of space. If the strings were, say, 100k each, we'd have problems.

So, the name of this site notwithstanding, here's some C code, poorly tested

/* Longest Common Substring * * C Implementation of the algorithm found at: * http://www.ics.uci.edu/~dan/class/161/notes/6/Dynamic.html */ /* This section is reproduced from the URL above. * * California law forbids anyone from making use of class * lecture notes for commercial purposes, and such activity * is expressly forbidden without the instructor's written * permission. * Longest common substring problem Given two sequences of letters, such as A = HELLO and B = ALOHA, find the longest contiguous sequence appearing in both. One solution: (assume strings have lengths m and n) For each of the m starting points of A, check for the longest common string starting at each of the n starting points of B. The checks could average (m) time a total of (m2n) time. Dynamic programming solution: Let Li, j = maximum length of common strings that end at A[i] & B[j]. Then, A[i] = B[j] Li, j = 1 + Li-1, j-1 A[i] B[j] Li, j = 0 LONGEST COMMON SUBSTRING(A,m,B,n) for i := 0 to m do Li,0 := 0 for j := 0 to n do L0,j := 0 len := 0 answer := <0,0> for i := 1 to m do for j := 1 to n do if Ai Bj then Li,j := 0 else Li,j := 1 + Li-1,j-1 if Li,j > len then len := Li,j answer = <i,j> Example: A L O H A H 0 0 0 1 0 E 0 0 0 0 0 L 0 1 0 0 0 L 0 1 0 0 0 O 0 0 2 0 0 * */ /* A type big enough to hold the string length. */ typedef long LCS_t; /* A type that can hold any element of the string. */ typedef char LCS_c; /* A should be a pointer to the beginning of the string. * m should be the index of the last character in the * string, plus one. * Likewise for B and n. * scratch must be an array of at least size m * n. */ LCS_t longest_common_substring ( LCS_c *A, LCS_t m, LCS_c *B, LCS_t n, LCS_t *scratch, LCS_t *answer_x, LCS_t *answer_y ) { LCS_t i, j; LCS_t len = 0; *answer_x = *answer_y = 0; #define L(x, y) scratch[(x) * n + (y)] for (i = 0; i < m; i++) L(i, 0) = 0; for (j = 0; j < n; j++) L(0, j) = 0; for (i = 0; i < m; i++) { for (j = 0; j < n; j++) { if (A[i] != B[j]) { L(i, j) = 0; } else { L(i, j) = 1 + L(i - 1, j - 1); if (L(i, j) > len) { len = L(i, j); *answer_x = i; *answer_y = j; } } } } #undef L return len; } #include <malloc.h> #include <stdio.h> #include <string.h> int main(int argc, char **argv) { if (argc != 3) return 2; char *A = argv[1]; char *B = argv[2]; long m = strlen(A); long n = strlen(B); long *scratch = (long *) malloc(m * n * sizeof(long)); if (!scratch) return 1; long answer_x, answer_y; long len; len = longest_common_substring(A, m, B, n, scratch, &answer_x, &answer_y); free(scratch); printf("LCS is %ld chars ending at <%ld,%ld>.\n", len, answer_x, answer_y); return 0; }

Maybe this would be a good time for me to learn Inline::C.


In reply to Re^2: Search for identical substrings by TilRMan
in thread Search for identical substrings by bioMan

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (4)
As of 2024-03-29 09:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found