Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re^2: count backrefenence regex

by tmolosh (Initiate)
on Oct 11, 2021 at 02:33 UTC ( [id://11137411]=note: print w/replies, xml ) Need Help??


in reply to Re: count backrefenence regex
in thread count backrefenence regex

I fully agree with your constructive comments. thanks. this was all supposed to be a quick-and-dirty time test to compare this attempted solution vs my work-around. but obviously dirty it was but not so quick.

my apologies for not fully fleshing-out the "problem" that the code is trying to address. like many, I assume everyone else knows what I am thinking and just jump right in where my head is at the moment.

in your scenario I would say you have 4 x "AAA_x_" but 2 "BBB" (not "BBB_x_")

my thinking is: given a string, e.g., "GATCGGGGACTTAGGATCCGATCT" where (if I typed it right) the string has 2 x "GATC" and 2 "GATCT", find the number of occurrences of each unique substring of length >= some minimum length (I used 3 in my code) that occur more than once. "GATC" occurs 4 times, but twice with the extra "T" so I would call those 2 different substrings.

BTW - I was using a 1,000,000 character string to make it take long enough to see a time difference.

also, I mis-spoke above, what I ultimately would report is substring and its locations (I figure for that I would use $` from the regex matching). If locations are pushed into an array, and I decide I want the count, I would just use the length of the array.

Replies are listed 'Best First'.
Re^3: count backrefenence regex
by AnomalousMonk (Archbishop) on Oct 11, 2021 at 03:11 UTC
    ... I was using a 1,000,000 character string ...

    Please see haukex's comment on this here in the paragraph beginning "You've got a few other issues in your code". In fact, the strings you were producing with the OPed code are only about 12,000 characters long.


    Give a man a fish:  <%-{-{-{-<

Re^3: count backrefenence regex
by LanX (Saint) on Oct 11, 2021 at 10:03 UTC
    > given a string, e.g., "GATCGGGGACTTAGGATCCGATCT" where (if I typed it right) the string has 2 x "GATC" and 2 "GATCT",

    you didn't

    DB<269> x "GATCGGGGACTTAGGATCCGATCT" =~ /(GATC)/g 0 'GATC' 1 'GATC' 2 'GATC' DB<270> x "GATCGGGGACTTAGGATCCGATCT" =~ /(GATCT)/g 0 'GATCT' DB<271>

    > each unique substring of length >= some minimum length (I used 3 in my code) that occur more than once.

    That's not solvable with a trivial regex because of the overlaps°, I suppose tybalt's complex solution with forced backtracking and embedded code for temporary results already nailed it.

    But I'm pretty sure we had this question here in the past. Maybe try super search

    Also seems identifying repeated sequences be a standard in BioInf, so some libraries should offer this.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

    °) (AAA_[BBB_)CCC]_(AAA_BBB_)[BBB_CCC] brackets ( and [ for different repeated but overlapping sequences.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11137411]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (4)
As of 2024-04-25 13:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found