Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Hi,

I have a matching problem that i would like your input on. It does not need to be code input but more of, how you would approach it. What I have is a set of paired strings such that sometimes a suffix of a string A is a prefix of a string B. When I say sometimes i mean that this is not always the case. Moreover, the suffix and the prefix need not to be perfect matches, that is, up to X number of mismatches are allowed. Therefore, in case a suffix and prefix do not match, the longest common substring is equal to X (however, such a match is of no value). To illustrate the problem better, consider the following example:

A: abbaba B: babbaaaa A: abbaba B: --babbaaaa
Given the X=1 suffix baba of A is the longest prefix of B. In order to find such matches i constructed a quite naive algorithm to solve it :
#!/usr/bin/perl use strict; srand(4); for (1..10){ &find_a_sufpref_match($_,&generate_random_string(4),&generate_random +_string(4),1); } sub find_a_sufpref_match { my ($id,$a,$b,$msm) = @_; my @a = split("",$a); my @b = split("",$b); my ($start,$length, $m) = (0,0,0); for (my $i = @a; $i>=0; $i--){ my ($x,$mx,$tl,$j) = ($i,0,0,0); while( $j <= @a-$i){ $mx++ if $a[$x] ne $b[$j]; $tl++; if ($mx > $msm){ if ($j < @a-$i){ $tl = 0; } $mx--; last; } $x++; $j++; } ($start,$length,$m) = ($i,$tl-1,$mx) if ($tl -1 > $length) } print "$id:($m)($start,$length)($a,$b)\n"; } sub generate_random_string{ my $length=shift;# the length of the random string to generate my @chars= qw(a b); my $random_string; foreach (1..$length){ $random_string.=$chars[rand @chars]; } return $random_string; }
Which outputs:
1:(1)(0,4)(bbab,bbbb) 2:(1)(0,4)(aaaa,aaba) 3:(1)(1,3)(baaa,abaa) 4:(1)(2,2)(aabb,bbab) 5:(1)(0,4)(abab,abaa) 6:(1)(0,4)(aabb,babb) 7:(1)(3,1)(aaaa,bbaa) 8:(1)(3,1)(baaa,bbba) 9:(1)(0,4)(aaab,baab) 10:(0)(0,4)(aaaa,aaaa)
Maybe the code has a bug in it, but currently i do not see it... The point is to start comparing strings inwards (increasing the size of a compared suffix/prefix match).Clearly this algorithm in O(|A|x|B|) and in case strings are random one could expect the runtime to be closer to linear since the X cutoff is expected to be reached frequently. In reality I am expecting to process 700-900 mil pairs each of size 20 - 2000 characters delivered in batches (files) of 10 mil pairs (that means i have a context of 10 mil pairs that i currently see no value in because the pairs are supposed to be unrelated, aside from the fact that i can process them in parallel).

Does anyone know a smarter (faster) way to solve this matching problem ?

What i also considered was KMP implementation, but i am not sure whether the preprocessing is worth the time and I am not quite sure how to go about the growing pattern length and allowed mismatches.

Anyways, thank you for any or none input provided :)

In reply to Suffix-prefix matching done right by baxy77bax

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (1)
As of 2024-04-24 17:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found