in reply to Fast common substring matching
The latest revisions to your code have fixed most of the problems I identified earlier, but running it over my test suite did turn up this anomoly:
P:\test>GF n2l3172.dat 000:001 L[ 12] ( 364 23) 000:001 L[ 12] ( 364 23)* 000:001 L[ 12] ( 364 23)* 000:001 L[ 12] ( 364 23)* 000:001 L[ 12] (1102 1008) 000:001 L[ 12] (1124 2290) 000:001 L[ 12] (1506 2529) 000:001 L[ 12] (1506 2529)* 000:001 L[ 12] (1506 2529)* 000:001 L[ 12] (1506 2529)* 000:001 L[ 12] (1859 224) 000:001 L[ 12] (2046 2360) 000:001 L[ 12] (2399 1226) 000:001 L[ 12] (2480 401) 000:001 L[ 12] (2494 540) 000:001 L[ 12] (3023 1102) Completed in 5.292793 Best match: >1 - >2. 12 characters starting at 364 and 23. Best match: >1 - >2. 12 characters starting at 364 and 23.* Best match: >1 - >2. 12 characters starting at 364 and 23.* Best match: >1 - >2. 12 characters starting at 364 and 23.* Best match: >1 - >2. 12 characters starting at 1102 and 1008. Best match: >1 - >2. 12 characters starting at 1124 and 2290. Best match: >1 - >2. 12 characters starting at 1506 and 2529. Best match: >1 - >2. 12 characters starting at 1506 and 2529.* Best match: >1 - >2. 12 characters starting at 1506 and 2529.* Best match: >1 - >2. 12 characters starting at 1506 and 2529.* Best match: >1 - >2. 12 characters starting at 1859 and 224. Best match: >1 - >2. 12 characters starting at 2046 and 2360. Best match: >1 - >2. 12 characters starting at 2399 and 1226. Best match: >1 - >2. 12 characters starting at 2480 and 401. Best match: >1 - >2. 12 characters starting at 2494 and 540. Best match: >1 - >2. 12 characters starting at 3023 and 1102.
As you can see, the matches I've marked with * are duplicates. Here's the output from my program:
P:\test>484593-5 n2l3172.dat 000:001 L[012] ( 364, 23)'CAGGAGCGGGCG' (1102,1008)'ACAGCCACGAAA' (1124,2290)'AGCAGGAAGGAG' (1506,2529)'GCGCGAAGCAAA' (1859, 224)'AACAGGAAGGGA' (2046,2360)'CAGAGCCCACAG' (2399,1226)'GGCGCAACCCGG' (2480, 401)'GGGGCCGGCAAC' (2494, 540)'CGCAGCAGAAAC' (3023,1102)'GGCAGCAAGGGC' 1 trial of n2l3172.dat (132.223ms total), 132.223ms/trial
As the dataset lines are 3k in length, I've posted it on my pad rather than here.
Sorry for the length of the lines, but they are randomly generated and I haven't yet worked out what it is about these two lines that induces the anomolous behaviour?
Update: Okay. I managed to find a shorter testcase that demonstrates the duplicate matches problem. Again, I've marked the duplicates with *:
P:\test\LCS>type temp >string 1 AACCAGGGTCATAGGAGCCGGCACTATGATCTCTAGTGTTTTAGGAGTAGTGAACCACCTTCCAGACACG +GCAGCCAACATTCCGTCTATCGCCTACTCG >string 2 TAGCACGGTGATAAGCTACAATTAACGCCATCTCACGCGCGCGAGTGTGTTCTACTACCCATTATATAAG +ATGAAAATGAACACTAAGGGCTGGGGCCCT P:\test\LCS>gf temp 000:001 L[ 5] ( 21 81) 000:001 L[ 5] ( 21 81)* 000:001 L[ 5] ( 28 29) 000:001 L[ 5] ( 34 43) 000:001 L[ 5] ( 35 46) 000:001 L[ 5] ( 50 77) 000:001 L[ 5] ( 66 3) 000:001 L[ 5] ( 66 3)* 000:001 L[ 5] ( 93 51) 000:001 L[ 5] ( 93 51)* Completed in 0.009414 Best match: >string 1 - >string 2. 5 characters starting at 21 and 81. Best match: >string 1 - >string 2. 5 characters starting at 21 and 81. Best match: >string 1 - >string 2. 5 characters starting at 28 and 29. Best match: >string 1 - >string 2. 5 characters starting at 34 and 43. Best match: >string 1 - >string 2. 5 characters starting at 35 and 46. Best match: >string 1 - >string 2. 5 characters starting at 50 and 77. Best match: >string 1 - >string 2. 5 characters starting at 66 and 3. Best match: >string 1 - >string 2. 5 characters starting at 66 and 3. Best match: >string 1 - >string 2. 5 characters starting at 93 and 51. Best match: >string 1 - >string 2. 5 characters starting at 93 and 51. P:\test\LCS>484593-5 temp 000:001 L[005] ( 21, 81)'CACTA' ( 28, 29)'ATCTC' ( 34, 43)'AGTGT' ( 35, 46)'GTGTT' ( 50, 77)'TGAAC' ( 66, 3)'CACGG' ( 93, 51)'CTACT' 1 trial of temp ( 147us total), 147us/trial
Update2: This is the shortest test that I've found that reproduces the problem. It may help you to isolate the cause.
P:\test\LCS>type temp >string 1 AAATATCTCGATAATTAGTA >string 2 TGACTAGATCATATAACCAC P:\test\LCS>gf temp 000:001 L[ 4] ( 2 10) 000:001 L[ 4] ( 2 10) 000:001 L[ 4] ( 10 12) Completed in 0.00106 Best match: >string 1 - >string 2. 4 characters starting at 2 and 10. Best match: >string 1 - >string 2. 4 characters starting at 2 and 10. Best match: >string 1 - >string 2. 4 characters starting at 10 and 12. P:\test\LCS>484593-5 temp 000:001 L[004] ( 2, 10)'ATAT' ( 10, 12)'ATAA' 1 trial of temp ( 36us total), 36us/trial