http://qs321.pair.com?node_id=486405


in reply to Re^3: Fast common substring matching
in thread Fast common substring matching

The down side is that it is not only a single character repeated ('AAAA'), but short repeating sequences ('ACTACTACT') that can be missed or truncated. The up side is that for bioMan's problem a minimum match quanta of 128 is probably optimum and I'd guess that that is long enough to be unlikely to be a problem.

At this time I've not thought of a fast way of dealing with the issue and am somewhat inclined to ignore it unless someone can convince me that this is really useful code, but needs this bug fixed.


Perl is Huffman encoded by design.

Replies are listed 'Best First'.
Re^5: Fast common substring matching
by BrowserUk (Patriarch) on Aug 25, 2005 at 02:23 UTC
    ... for bioMan's problem a minimum match quanta of 128 is probably optimum and I'd guess that that is long enough to be unlikely to be a problem.

    Seems to be. Scanning for repeating sequences of 2, 3 & 4 characters, none was longer then 50 chars, so a minimum quanta of 64 would also probably be possible.

    inclined to ignore it unless someone can convince me that this is really useful

    I understand that totally. I ended up resorting to Inline C to get speed because every attempt to improved the performance of my perl versions ended up missing things.

    Shame though. Your technique is so very fast for a pure perl solution it would be a real coup if it could be generalised.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.

      I doubt that even a pure C solution using this technique would be much faster. I haven't profiled it, but I'd guess most of the time is in the index and that is likely pretty efficient anyway.

      I think some fussy code could handle the special case without impacting performance too much. The key would be detecting that a search sub-pattern was a repeating pattern and then "drifting" the pattern left by the repeat length to see if there is an earlier match against the target string than was found by index. Maybe I need to write some code so you see what I mean? :)


      Perl is Huffman encoded by design.

        I am presently running my complete dataset with your program. The program has been merrily churning away for about 48 hours. When it completes this task I'll let you know how things turned out.

        update

        Oops, I had to restart the run. When I set up the program I added specific code to hardwire the name of my data file into the program. When I did this I created a bug, which caused the program to idle. I had not removed the $_ = <>; line, so the program was waiting for me to enter data from the keyboard.

        commented out the if (@argv != 1){...} and added the following:

        my $file = "mydata.txt"; open FILE, $file or die "Can't open $file: $!\n"; my $out = "outdata.txt"; open OUT , '>', $out or die "Can't open $out: $!\n"; # all print and printf statements now print to # this file handle # Read in the strings chomp(my @file = <FILE>); # declare variables my @strings = (); my $place = 1; my $strName = ''; # necessary for resolution of # an undeclared global variable # warning for (@file){ if ($place){ $strName = $_; # seq ID $place = 0; }else{ push @strings, [$strName, $_]; # push seq ID, seq $place = 1; } }