http://qs321.pair.com?node_id=11138432

baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have a matching problem that i would like your input on. It does not need to be code input but more of, how you would approach it. What I have is a set of paired strings such that sometimes a suffix of a string A is a prefix of a string B. When I say sometimes i mean that this is not always the case. Moreover, the suffix and the prefix need not to be perfect matches, that is, up to X number of mismatches are allowed. Therefore, in case a suffix and prefix do not match, the longest common substring is equal to X (however, such a match is of no value). To illustrate the problem better, consider the following example:

A: abbaba B: babbaaaa A: abbaba B: --babbaaaa
Given the X=1 suffix baba of A is the longest prefix of B. In order to find such matches i constructed a quite naive algorithm to solve it :
#!/usr/bin/perl use strict; srand(4); for (1..10){ &find_a_sufpref_match($_,&generate_random_string(4),&generate_random +_string(4),1); } sub find_a_sufpref_match { my ($id,$a,$b,$msm) = @_; my @a = split("",$a); my @b = split("",$b); my ($start,$length, $m) = (0,0,0); for (my $i = @a; $i>=0; $i--){ my ($x,$mx,$tl,$j) = ($i,0,0,0); while( $j <= @a-$i){ $mx++ if $a[$x] ne $b[$j]; $tl++; if ($mx > $msm){ if ($j < @a-$i){ $tl = 0; } $mx--; last; } $x++; $j++; } ($start,$length,$m) = ($i,$tl-1,$mx) if ($tl -1 > $length) } print "$id:($m)($start,$length)($a,$b)\n"; } sub generate_random_string{ my $length=shift;# the length of the random string to generate my @chars= qw(a b); my $random_string; foreach (1..$length){ $random_string.=$chars[rand @chars]; } return $random_string; }
Which outputs:
1:(1)(0,4)(bbab,bbbb) 2:(1)(0,4)(aaaa,aaba) 3:(1)(1,3)(baaa,abaa) 4:(1)(2,2)(aabb,bbab) 5:(1)(0,4)(abab,abaa) 6:(1)(0,4)(aabb,babb) 7:(1)(3,1)(aaaa,bbaa) 8:(1)(3,1)(baaa,bbba) 9:(1)(0,4)(aaab,baab) 10:(0)(0,4)(aaaa,aaaa)
Maybe the code has a bug in it, but currently i do not see it... The point is to start comparing strings inwards (increasing the size of a compared suffix/prefix match).Clearly this algorithm in O(|A|x|B|) and in case strings are random one could expect the runtime to be closer to linear since the X cutoff is expected to be reached frequently. In reality I am expecting to process 700-900 mil pairs each of size 20 - 2000 characters delivered in batches (files) of 10 mil pairs (that means i have a context of 10 mil pairs that i currently see no value in because the pairs are supposed to be unrelated, aside from the fact that i can process them in parallel).

Does anyone know a smarter (faster) way to solve this matching problem ?

What i also considered was KMP implementation, but i am not sure whether the preprocessing is worth the time and I am not quite sure how to go about the growing pattern length and allowed mismatches.

Anyways, thank you for any or none input provided :)