Suffix-prefix matching done right

baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have a matching problem that i would like your input on. It does not need to be code input but more of, how you would approach it. What I have is a set of paired strings such that sometimes a suffix of a string A is a prefix of a string B. When I say sometimes i mean that this is not always the case. Moreover, the suffix and the prefix need not to be perfect matches, that is, up to X number of mismatches are allowed. Therefore, in case a suffix and prefix do not match, the longest common substring is equal to X (however, such a match is of no value). To illustrate the problem better, consider the following example:

A: abbaba
B: babbaaaa

A: abbaba
B: --babbaaaa
[download]

Given the X=1 suffix baba of A is the longest prefix of B. In order to find such matches i constructed a quite naive algorithm to solve it :

#!/usr/bin/perl

use strict;
srand(4);

for (1..10){
  &find_a_sufpref_match($_,&generate_random_string(4),&generate_random
+_string(4),1);
}


sub find_a_sufpref_match {
  my ($id,$a,$b,$msm) = @_;

  my @a = split("",$a);
  my @b = split("",$b);
  my ($start,$length, $m) = (0,0,0);

  for (my $i = @a; $i>=0; $i--){
     my ($x,$mx,$tl,$j) = ($i,0,0,0);
      while( $j <= @a-$i){
        $mx++ if $a[$x] ne $b[$j];
        $tl++;
        if ($mx > $msm){
          if ($j < @a-$i){
            $tl = 0;
          }
          $mx--;
          last;
        }
        $x++;
        $j++;
      }
      ($start,$length,$m) = ($i,$tl-1,$mx) if ($tl -1 > $length)
  }
  print "$id:($m)($start,$length)($a,$b)\n";
}




sub generate_random_string{
  my $length=shift;# the length of the random string to generate

  my @chars= qw(a b);
  my $random_string;
  foreach (1..$length){
    $random_string.=$chars[rand @chars];
  }
  return $random_string;
}
[download]

Which outputs:

1:(1)(0,4)(bbab,bbbb)
2:(1)(0,4)(aaaa,aaba)
3:(1)(1,3)(baaa,abaa)
4:(1)(2,2)(aabb,bbab)
5:(1)(0,4)(abab,abaa)
6:(1)(0,4)(aabb,babb)
7:(1)(3,1)(aaaa,bbaa)
8:(1)(3,1)(baaa,bbba)
9:(1)(0,4)(aaab,baab)
10:(0)(0,4)(aaaa,aaaa)
[download]

Maybe the code has a bug in it, but currently i do not see it... The point is to start comparing strings inwards (increasing the size of a compared suffix/prefix match).Clearly this algorithm in O(|A|x|B|) and in case strings are random one could expect the runtime to be closer to linear since the X cutoff is expected to be reached frequently. In reality I am expecting to process 700-900 mil pairs each of size 20 - 2000 characters delivered in batches (files) of 10 mil pairs (that means i have a context of 10 mil pairs that i currently see no value in because the pairs are supposed to be unrelated, aside from the fact that i can process them in parallel).

Does anyone know a smarter (faster) way to solve this matching problem ?

What i also considered was KMP implementation, but i am not sure whether the preprocessing is worth the time and I am not quite sure how to go about the growing pattern length and allowed mismatches.

Anyways, thank you for any or none input provided :)

Back to Seekers of Perl Wisdom