Re^2: count backrefenence regex

I fully agree with your constructive comments. thanks. this was all supposed to be a quick-and-dirty time test to compare this attempted solution vs my work-around. but obviously dirty it was but not so quick.

my apologies for not fully fleshing-out the "problem" that the code is trying to address. like many, I assume everyone else knows what I am thinking and just jump right in where my head is at the moment.

in your scenario I would say you have 4 x "AAA_x_" but 2 "BBB" (not "BBB_x_")

my thinking is: given a string, e.g., "GATCGGGGACTTAGGATCCGATCT" where (if I typed it right) the string has 2 x "GATC" and 2 "GATCT", find the number of occurrences of each unique substring of length >= some minimum length (I used 3 in my code) that occur more than once. "GATC" occurs 4 times, but twice with the extra "T" so I would call those 2 different substrings.

BTW - I was using a 1,000,000 character string to make it take long enough to see a time difference.

also, I mis-spoke above, what I ultimately would report is substring and its locations (I figure for that I would use $` from the regex matching). If locations are pushed into an array, and I decide I want the count, I would just use the length of the array.

Comment on Re^2: count backrefenence regex

Replies are listed 'Best First'.
Re^3: count backrefenence regex by AnomalousMonk (Archbishop) on Oct 11, 2021 at 03:11 UTC
... I was using a 1,000,000 character string ... Please see haukex's comment on this here in the paragraph beginning "You've got a few other issues in your code". In fact, the strings you were producing with the OPed code are only about 12,000 characters long. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l]
Re^3: count backrefenence regex by LanX (Saint) on Oct 11, 2021 at 10:03 UTC
> given a string, e.g., "GATCGGGGACTTAGGATCCGATCT" where (if I typed it right) the string has 2 x "GATC" and 2 "GATCT", you didn't `DB<269> x "GATCGGGGACTTAGGATCCGATCT" =~ /(GATC)/g 0 'GATC' 1 'GATC' 2 'GATC' DB<270> x "GATCGGGGACTTAGGATCCGATCT" =~ /(GATCT)/g 0 'GATCT' DB<271>` [download] > each unique substring of length >= some minimum length (I used 3 in my code) that occur more than once. That's not solvable with a trivial regex because of the overlaps°, I suppose tybalt's complex solution with forced backtracking and embedded code for temporary results already nailed it. But I'm pretty sure we had this question here in the past. Maybe try super search Also seems identifying repeated sequences be a standard in BioInf, so some libraries should offer this. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery} °) `(AAA_[BBB_)CCC]_(AAA_BBB_)[BBB_CCC]` brackets ( and [ for different repeated but overlapping sequences.	[reply] [d/l] [select]


Clear questions and runnable code get the best and fastest answer
	PerlMonks