spq has asked for the wisdom of the Perl Monks concerning the following question:
I have a database table containing regular expresions used against strings (sequences of characters to be used in DNA oligo synthesis) as part of a QC process. So far, this has worked great. But now I have a condition that I'm having trouble writing the regex for.
The string is required to contain only alpha characters. Case is not important, and mixed cases are allowed, so the i modifier is used for all match expresions.
So here's the outline of the new condition. Any number of A,T,C and G are always allowed. Some orders may contain symbols representing degenerate possitions however. For example, R may be used to represent a position that can be either A or G. The total list of possible alternate symbols is R,Y,M,K,S,W,H,B,V,D, and N. Although any of them may be used, only a total of two different alternate codes can be used in a given string (mechanical limitation of the synthesis machine).
So, the chalange is to have a single regular expresion that will match a sequence containing any number of A,T,C,G and any number of no more than two different characters from the above alternate codes.
Thanks in advance for whatever wisdom and guidence you can bestow!
Re: self limiting regex help
by danger (Priest) on May 22, 2002 at 15:53 UTC
|
If I understand your conditions correctly, this should do:
print if /^ [ACGT]* ([RYMKSWHBVDN])?
(?:[ACGT]|\1)*([RYMKSWHBVDN])?
(?:[ACGT]|\1|\2)*$
/ix;
| [reply] [d/l] |
|
Eureka! I was just attempting to do something similar, but had discovered that \n doesn't work in a character class.
Thank you (and everyone who responded) very much for your help!
| [reply] |
Re: self limiting regex help
by Molt (Chaplain) on May 22, 2002 at 16:03 UTC
|
Okay, having read a fair bit of 'Mastering Regular Expressions' today I'm going to attack this one.
Not going to fully comment the regexp, but essentially it does the following.. matches any number of ATCG's, then a single exception character which it stores in \1, then any number of ATCG's or \1s, then another single different exception character which it stores in \2, then any number of ATCG's, \1s, or \2s. This pattern is anchored to each end of the string too.
Sorry the regexp isn't nicely laid out, but it should work.
If this doesn't do quite what you want let me have some more test data and I'll fix it. Nice puzzle!
#!/usr/bin/perl -w
use strict;
my @tests = (
'ATCG',
'ATCGGTATATATRGTCGAYGCRGTCAGA',
'ATCGGTATATATRGTCGAYGCNGTCAGA',
);
foreach (@tests) {
if(/^[ATCG]*([RYMKSWHBVDN])?(?:[ATCG]|\1)*([RYMKSWHBVDN])?(?:[ATCG]|
+\1|\2)*$/i){
print "$_ matches\n";
} else {
print "$_ does not match\n";
}
}
Update: Slight change, original version demanded two exceptional codes. Oops.
Update 2: Yes, this is the same as Danger's code above. Ah well, two people coming up with the same solution at least inspires confidence. | [reply] [d/l] |
Re: self limiting regex help
by ferrency (Deacon) on May 22, 2002 at 15:23 UTC
|
If your string is allowed to be empty, try this:
print "match" if $string =~ /[atcg]*([RYMKSWHBVDN][atcg]*){0,2}/i;
If not, try this:
print "match" if $string and $string =~ /[atcg]*([RYMKSWHBVDN][atcg]*)
+{0,2}/i;
(Warning, neither was tested)
Update:Sorry about that: You're both right, I misunderstood the
initial request. The regex above matches up to 2 occurrances of any of the alternate codes,
not any number of occurrances of up to 2 of the alternate codes.
You could do what you want with an extremely long
regex which enumerates every combination of 2
alternate codes. That's a really bad answer, though: it
would be much shorter and more straightforward to do it
with regexes supplemented by other perl code.
Sorry for the wrong answer :)
Alan
| [reply] [d/l] [select] |
|
Hmm, that looks like it would match fine. But I don't see how it would limit a string to containing any number of occurances of only two of the alternate codes?
In case I wasn't clear in my first posting, the regex should match on a string that is within the QC criteria, but fail if not. So:
ATCGGTATATATRGTCGAYGCRGTCAGA
Would be matched, but:
ATCGGTATATATRGTCGAYGCNGTCAGA
Wouldn't, because the N near the end introduces a third ambiguity code.
| [reply] |
|
| [reply] |
Re: self limiting regex help
by vladb (Vicar) on May 22, 2002 at 15:26 UTC
|
To limit the number of 'special' character you may have in a matching text, you could use this:
/[RYMKSWHBVDN]{0,2}/i
However, I'm not quite sure how to integrate this piece that would also satisfy this requirement:
any number of A,T,C,G ...
Could you try something along those lines:
/[ATCG][RYMKSWHBVDN]{0,2}/i
Oww, but then, they could be mixed right?
UPDATE: Oh well, I believe solution offered by ferrency is somewhat closer to what you need (Note: I didn't notice his solution until I actually submitted my alternative ;/)
_____________________
$"=q;grep;;$,=q"grep";for(`find . -name ".saves*~"`){s;$/;;;/(.*-(\d+)
+-.*)$/;$_=["ps -e -o pid | "," $2 | "," -v "," "]`@$_`?{print"
++ $1"}:{print"- $1"}&&`rm $1`;print"\n";}
| [reply] [d/l] [select] |
|
Thanks.
Although they can be mixed, I suppose stating that there can be any number of ATCG's may mislead. Now that I read your post, I think I may have been getting hung up on that myself. Other QC regex's applied ensure that the string contains only the allowable characters as a class. The current methodology I've applied is to get all relevant QC expresions from the database and try each expresion in turn against the string. Currently there is no ordering, and I don't think that should matter, if all pass. But I could add it.
So the real factor is only whether or not there are multiple occurances of more than 2 of the allowable class of alternate codes.
| [reply] |
|
|