comment on

Hi,

I have a matching problem that i would like your input on. It does not need to be code input but more of, how you would approach it. What I have is a set of paired strings such that sometimes a suffix of a string A is a prefix of a string B. When I say sometimes i mean that this is not always the case. Moreover, the suffix and the prefix need not to be perfect matches, that is, up to X number of mismatches are allowed. Therefore, in case a suffix and prefix do not match, the longest common substring is equal to X (however, such a match is of no value). To illustrate the problem better, consider the following example:

A: abbaba
B: babbaaaa

A: abbaba
B: --babbaaaa
[download]

Given the X=1 suffix baba of A is the longest prefix of B. In order to find such matches i constructed a quite naive algorithm to solve it :

#!/usr/bin/perl

use strict;
srand(4);

for (1..10){
  &find_a_sufpref_match($_,&generate_random_string(4),&generate_random
+_string(4),1);
}


sub find_a_sufpref_match {
  my ($id,$a,$b,$msm) = @_;

  my @a = split("",$a);
  my @b = split("",$b);
  my ($start,$length, $m) = (0,0,0);

  for (my $i = @a; $i>=0; $i--){
     my ($x,$mx,$tl,$j) = ($i,0,0,0);
      while( $j <= @a-$i){
        $mx++ if $a[$x] ne $b[$j];
        $tl++;
        if ($mx > $msm){
          if ($j < @a-$i){
            $tl = 0;
          }
          $mx--;
          last;
        }
        $x++;
        $j++;
      }
      ($start,$length,$m) = ($i,$tl-1,$mx) if ($tl -1 > $length)
  }
  print "$id:($m)($start,$length)($a,$b)\n";
}




sub generate_random_string{
  my $length=shift;# the length of the random string to generate

  my @chars= qw(a b);
  my $random_string;
  foreach (1..$length){
    $random_string.=$chars[rand @chars];
  }
  return $random_string;
}
[download]

Which outputs:

1:(1)(0,4)(bbab,bbbb)
2:(1)(0,4)(aaaa,aaba)
3:(1)(1,3)(baaa,abaa)
4:(1)(2,2)(aabb,bbab)
5:(1)(0,4)(abab,abaa)
6:(1)(0,4)(aabb,babb)
7:(1)(3,1)(aaaa,bbaa)
8:(1)(3,1)(baaa,bbba)
9:(1)(0,4)(aaab,baab)
10:(0)(0,4)(aaaa,aaaa)
[download]

Maybe the code has a bug in it, but currently i do not see it... The point is to start comparing strings inwards (increasing the size of a compared suffix/prefix match).Clearly this algorithm in O(|A|x|B|) and in case strings are random one could expect the runtime to be closer to linear since the X cutoff is expected to be reached frequently. In reality I am expecting to process 700-900 mil pairs each of size 20 - 2000 characters delivered in batches (files) of 10 mil pairs (that means i have a context of 10 mil pairs that i currently see no value in because the pairs are supposed to be unrelated, aside from the fact that i can process them in parallel).

Does anyone know a smarter (faster) way to solve this matching problem ?

What i also considered was KMP implementation, but i am not sure whether the preprocessing is worth the time and I am not quite sure how to go about the growing pattern length and allowed mismatches.

Anyways, thank you for any or none input provided :)

In reply to Suffix-prefix matching done right by baxy77bax

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


laziness, impatience, and hubris
	PerlMonks