Re: Search for identical substrings

I think the code you posted doesn't make a lot of sense, given the problem that you described at the outset. If you're looking for the longest common substring in two strings that are each 3k characters long, and you're doing this across all distinct pairings of 300 such strings, you probably want to look at Algorithm::Diff and its "LCS" function.

I haven't used it much myself; looking at the man page, it seems like you might need to split each string into an array of characters -- something like this (not tested):

#!/usr/bin/perl

use strict;
use Algorithm::Diff qw/LCS/;

my @strings;

# fill @strings with your 300 elements of 3k characters each, then ...

for my $i ( 0 .. $#strings-1 ) {
    my @seq1 = split //, $strings[$i];
    for my $j ( $i+1 .. $#strings ) {
        my @seq2 = split //, $strings[$j];
        my @lcs = LCS( \@seq1, \@seq2 );
        print "LCS for $i :: $j is ", join( "",@lcs,"\n" );
    }
}
[download]

I'm just guessing about that. Do heed Grandfather's request, and show us some data (and maybe some other code you've tried that really is more relevant).

(updated code to fix a variable-name typo and the comment, and to declare @strings; also added spacing in the print statement, to avoid misuse of "::" as a namespace designation. Tested it on a sample text file (380 lines but less than 100 chars/line), and it seemed to do what's desired** -- not tremendously fast, but not impossibly slow.)

(** second update: ** then again, looking at the output, it's not clear that my script is actually making correct use of the module -- I don't seem to be getting a single contiguous "LCS" string from each comparison as I was expecting, but rather a bunch of fragments that are each one or more characters long. Better read further in the man page...)

Comment on Re: Search for identical substrings Download Code

Replies are listed 'Best First'.
Re^2: Search for identical substrings by bioMan (Beadle) on Aug 18, 2005 at 16:31 UTC
I cobbled the code together by breaking my problem into discrete tasks. I then started writing code to accomplish each task and testing the code to ensure it worked. As I got each task working I added code to accomplish the next task. I admit it only makes sense to me because I know what the original problem was and the steps I took to solve each task. It aint pretty. I have been writing small perl scripts for more than a year but I still don't consider myself the be a perl programmer. I need to work through all the thoughtful suggestions I have received, and I thank everyone for their help.	[reply]
Re^2: Search for identical substrings by bioMan (Beadle) on Aug 18, 2005 at 17:00 UTC
Ultimately all my code revolves around the regex `/(.{$min,}).\1/gi` I realize that the regex is the rate limiting step in this process. Of course $min is necessary because allowing regex to look for all* matches is prohibitive. Being ignorant of the subtleties of perl I don't know if there are other ways to effectively match substrings.	[reply] [d/l]


The stupid question is the question not asked
	PerlMonks