http://qs321.pair.com?node_id=487102


in reply to Fast common substring matching

The latest revisions to your code have fixed most of the problems I identified earlier, but running it over my test suite did turn up this anomoly:

P:\test>GF n2l3172.dat 000:001 L[ 12] ( 364 23) 000:001 L[ 12] ( 364 23)* 000:001 L[ 12] ( 364 23)* 000:001 L[ 12] ( 364 23)* 000:001 L[ 12] (1102 1008) 000:001 L[ 12] (1124 2290) 000:001 L[ 12] (1506 2529) 000:001 L[ 12] (1506 2529)* 000:001 L[ 12] (1506 2529)* 000:001 L[ 12] (1506 2529)* 000:001 L[ 12] (1859 224) 000:001 L[ 12] (2046 2360) 000:001 L[ 12] (2399 1226) 000:001 L[ 12] (2480 401) 000:001 L[ 12] (2494 540) 000:001 L[ 12] (3023 1102) Completed in 5.292793 Best match: >1 - >2. 12 characters starting at 364 and 23. Best match: >1 - >2. 12 characters starting at 364 and 23.* Best match: >1 - >2. 12 characters starting at 364 and 23.* Best match: >1 - >2. 12 characters starting at 364 and 23.* Best match: >1 - >2. 12 characters starting at 1102 and 1008. Best match: >1 - >2. 12 characters starting at 1124 and 2290. Best match: >1 - >2. 12 characters starting at 1506 and 2529. Best match: >1 - >2. 12 characters starting at 1506 and 2529.* Best match: >1 - >2. 12 characters starting at 1506 and 2529.* Best match: >1 - >2. 12 characters starting at 1506 and 2529.* Best match: >1 - >2. 12 characters starting at 1859 and 224. Best match: >1 - >2. 12 characters starting at 2046 and 2360. Best match: >1 - >2. 12 characters starting at 2399 and 1226. Best match: >1 - >2. 12 characters starting at 2480 and 401. Best match: >1 - >2. 12 characters starting at 2494 and 540. Best match: >1 - >2. 12 characters starting at 3023 and 1102.

As you can see, the matches I've marked with * are duplicates. Here's the output from my program:

P:\test>484593-5 n2l3172.dat 000:001 L[012] ( 364, 23)'CAGGAGCGGGCG' (1102,1008)'ACAGCCACGAAA' (1124,2290)'AGCAGGAAGGAG' (1506,2529)'GCGCGAAGCAAA' (1859, 224)'AACAGGAAGGGA' (2046,2360)'CAGAGCCCACAG' (2399,1226)'GGCGCAACCCGG' (2480, 401)'GGGGCCGGCAAC' (2494, 540)'CGCAGCAGAAAC' (3023,1102)'GGCAGCAAGGGC' 1 trial of n2l3172.dat (132.223ms total), 132.223ms/trial

As the dataset lines are 3k in length, I've posted it on my pad rather than here.

Sorry for the length of the lines, but they are randomly generated and I haven't yet worked out what it is about these two lines that induces the anomolous behaviour?

Update: Okay. I managed to find a shorter testcase that demonstrates the duplicate matches problem. Again, I've marked the duplicates with *:

P:\test\LCS>type temp >string 1 AACCAGGGTCATAGGAGCCGGCACTATGATCTCTAGTGTTTTAGGAGTAGTGAACCACCTTCCAGACACG +GCAGCCAACATTCCGTCTATCGCCTACTCG >string 2 TAGCACGGTGATAAGCTACAATTAACGCCATCTCACGCGCGCGAGTGTGTTCTACTACCCATTATATAAG +ATGAAAATGAACACTAAGGGCTGGGGCCCT P:\test\LCS>gf temp 000:001 L[ 5] ( 21 81) 000:001 L[ 5] ( 21 81)* 000:001 L[ 5] ( 28 29) 000:001 L[ 5] ( 34 43) 000:001 L[ 5] ( 35 46) 000:001 L[ 5] ( 50 77) 000:001 L[ 5] ( 66 3) 000:001 L[ 5] ( 66 3)* 000:001 L[ 5] ( 93 51) 000:001 L[ 5] ( 93 51)* Completed in 0.009414 Best match: >string 1 - >string 2. 5 characters starting at 21 and 81. Best match: >string 1 - >string 2. 5 characters starting at 21 and 81. Best match: >string 1 - >string 2. 5 characters starting at 28 and 29. Best match: >string 1 - >string 2. 5 characters starting at 34 and 43. Best match: >string 1 - >string 2. 5 characters starting at 35 and 46. Best match: >string 1 - >string 2. 5 characters starting at 50 and 77. Best match: >string 1 - >string 2. 5 characters starting at 66 and 3. Best match: >string 1 - >string 2. 5 characters starting at 66 and 3. Best match: >string 1 - >string 2. 5 characters starting at 93 and 51. Best match: >string 1 - >string 2. 5 characters starting at 93 and 51. P:\test\LCS>484593-5 temp 000:001 L[005] ( 21, 81)'CACTA' ( 28, 29)'ATCTC' ( 34, 43)'AGTGT' ( 35, 46)'GTGTT' ( 50, 77)'TGAAC' ( 66, 3)'CACGG' ( 93, 51)'CTACT' 1 trial of temp ( 147us total), 147us/trial

Update2: This is the shortest test that I've found that reproduces the problem. It may help you to isolate the cause.

P:\test\LCS>type temp >string 1 AAATATCTCGATAATTAGTA >string 2 TGACTAGATCATATAACCAC P:\test\LCS>gf temp 000:001 L[ 4] ( 2 10) 000:001 L[ 4] ( 2 10) 000:001 L[ 4] ( 10 12) Completed in 0.00106 Best match: >string 1 - >string 2. 4 characters starting at 2 and 10. Best match: >string 1 - >string 2. 4 characters starting at 2 and 10. Best match: >string 1 - >string 2. 4 characters starting at 10 and 12. P:\test\LCS>484593-5 temp 000:001 L[004] ( 2, 10)'ATAT' ( 10, 12)'ATAA' 1 trial of temp ( 36us total), 36us/trial

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.