monkfan has asked for the wisdom of the Perl Monks concerning the following question:
Hi,
I have two strings that I want to compute the
number of mismatches between them.
These two strings are of the "same" size. Let's call
them 'source' string and 'target' string.
Now, the problem is that the 'source' string may
come in ambiguous form, meaning that in one position
they may contain more than 1 (upto 4) characters.
The ambiguous position is marked with square bracketed [ATCG] region.
Is there an efficient way I can compute the number of
mismatch between "source" and "target", of the above
situation?
The example is as follows:
Example 1 (where the source is ambiguous):
my $source1 = '[TCG]GGGG[AT]'; # ambiguous
my $target1 = 'AGGGGC'; # No of mismatch = 2 on position 1 and 6
my $target2 = 'TGGGGC'; # No of mismatch = 1 on position 6 only
I have no problem when dealing when 'source' string is not
ambigous like this
Example 2 (where the source is NOT ambiguous):
my $source2 = 'TGGGGT'; # not-ambiguous
my $target1 = 'AGGGGC'; # No of mismatch = 2 on position 1 and 6
my $target3 = 'TGGGGT'; # No of mismatch = 0 all position matches
For example I can use bitwise operator to do it.
But for ambigous case I'm lost.
PS: "target" string can be assumed to be never ambiguous.
Re: Mismatch Positions of Ambiguous String
by GrandFather (Saint) on Jan 25, 2006 at 11:10 UTC
|
use strict;
use warnings;
my @targets = qw(AGGGGC TGGGGC);
my @sources = (<DATA>);
chomp @sources;
for my $source (@sources) {
my @fields = grep length, split /(\[[^[\]]+\]|[^[\]]+)/, $source;
my @cells = map {(index $_, '[') == 0 ? $_ : split ''} @fields;
my $matchStr = join '', map {/\[/ ? "\x80" : $_} @fields;
for my $target (@targets) {
my @segments = grep length, split /(\x00+|.)/, $target ^ $matchStr
+;
my $index = 0;
for my $field (@segments) {
next if length ($field) > 1 or $field eq "\x00";
# Only here for an ambiguous char
my $str = $cells[$index];
my $chr = substr $target, $index, 1;
next if $chr =~ /$str/;
print "Mismatch at index $index in $source against $target\n";
}
continue {
$index += length $field;
}
}
}
__DATA__
[TCG]GGGG[AT]
TGGGGT
[AC][TCG]GGG[AT]
Prints:
Mismatch at index 0 in [TCG]GGGG[AT] against AGGGGC
Mismatch at index 5 in [TCG]GGGG[AT] against AGGGGC
Mismatch at index 5 in [TCG]GGGG[AT] against TGGGGC
Mismatch at index 0 in TGGGGT against AGGGGC
Mismatch at index 5 in TGGGGT against AGGGGC
Mismatch at index 5 in TGGGGT against TGGGGC
Mismatch at index 5 in [AC][TCG]GGG[AT] against AGGGGC
Mismatch at index 0 in [AC][TCG]GGG[AT] against TGGGGC
Mismatch at index 5 in [AC][TCG]GGG[AT] against TGGGGC
DWIM is Perl's answer to Gödel
| [reply] [d/l] [select] |
Re: Mismatch Positions of Ambiguous String
by pKai (Priest) on Jan 25, 2006 at 13:09 UTC
|
Since characters in square brackets happen to be character classes in regexes, why not use them as such!?
use strict;
use warnings;
use Algorithm::Loops qw(MapCarE);
MapCarE {
my ($s, $t) = @_;
my $x= $t =~ /$s/ ? '~' : '!';
print "$t $x~ $s\n";
}
[shift =~ /\[\w+\]|\w/g], # split source into chars or char classes
[split //, shift]; # split target into chars
D:\temp>mm [TCG]GGGG[AT] AGGGGC
A !~ [TCG]
G ~~ G
G ~~ G
G ~~ G
G ~~ G
C !~ [AT]
D:\temp>mm [TCG]GGGG[AT] TGGGGC
T ~~ [TCG]
G ~~ G
G ~~ G
G ~~ G
G ~~ G
C !~ [AT]
Using Algorithm::Loops just for the convenience to iterate over two lists at once without distracting from the main point. | [reply] [d/l] [select] |
Re: Mismatch Positions of Ambiguous String
by hv (Prior) on Jan 25, 2006 at 10:15 UTC
|
I was going to recommend String::Approx, but it turns out that it doesn't yet support regexp patterns. Here's a simple approach to doing it yourself:
sub mismatches {
my($source, $target) = @_;
my @sparts = ($source =~ /(\[.*?\]|.)/g);
scalar grep substr($target, $_, 1) !~ /^$sparts[$_]/, 0 .. $#spart
+s;
}
This assumes source and target are the same length; if you're going to be searching for the closest match within a longer target you'll need a bit more work, and if you need decent performance on very large strings you'll almost certainly want to code it in C/XS.
Hugo | [reply] [d/l] |
|
Dear hv,
Sorry for having have to come back to you again.
How can I extend your code above, such that
it can handle the target which is also ambiguous like this:
my $source1 = '[TCG]GGGG[AT]'; # ambiguous
my $target1 = 'AGGGG[CT]'; # No of mismatch = 1 only at position 1
Also the size of source and target will be the same.
I tried this:
sub mismatches {
my($source, $target) = @_;
my @sparts = ($source =~ /(\[.*?\]|.)/g);
my @tparts = ($target =~ /(\[.*?\]|.)/g);
scalar grep $tparts[$_] !~ /^$sparts[$_]/, 0 .. $#sparts;
}
But doesn't work. Where did I go wrong?
| [reply] [d/l] [select] |
|
But doesn't work. Where did I go wrong?
It isn't as simple as that: my original approach relied on treating each ambiguous [ACGT] in the source string as a regexp, but to do that it needs to know a) which is the fixed string and which the regexp, and b) that they are not both regexps.
The simplest extension is as below, but this suffers further on speed - it will probably be too slow if you're dealing with strings a few thousand base pairs long:
sub mismatches {
my($source, $target) = @_;
my @sparts = ($source =~ /(\[.*?\]|.)/g);
my @tparts = ($target =~ /(\[.*?\]|.)/g);
scalar grep {
my($s, $t) = ($sparts[$_], $tparts[$_]);
$s !~ /\[/ ? ($s !~ /$t/)
: $t !~ /\[/ ? ($t !~ /$s/)
: !intersect($s, $t)
} 0 .. $#sparts;
}
sub intersect {
my($s, $t) = @_;
my %seen = map +($_ => 1), $s =~ /[^\[\]]/g;
scalar grep $seen{$_}, $t =~ /[^\[\]]/g;
}
This says: if source is not ambiguous, treat the corresponding fragment of the target as a regexp; else if the target is not ambiguous, treat the source fragment as a regexp; else check a full intersection of the two.
If your strings only include ACGT, a more efficient approach would be to transform each string into a bit vector that sets a bit for each base that may be present: my %bits = ('A' => 1, 'C' => 2, 'G' => 4, 'T' => 8);
my $source1 = bitwise('[TCG]GGGG[AT]');
my $target1 = bitwise('AGGGG[CT]');
print mismatches($source, $target1), "\n";
sub mismatches {
my($source, $target) = @_;
($source & $target) =~ tr/\0//;
}
sub bitwise {
my $string = shift;
join '', map {
my $char = 0;
$char |= $bits{$_} for /[ACGT]/g;
chr($char)
} $string =~ /(\[.*?\]|.)/g;
}
Once the strings are transformed into this bitwise representation, checking for mismatches is very fast even with long strings.
Hugo | [reply] [d/l] [select] |
Re: Mismatch Positions of Ambiguous String
by prasadbabu (Prior) on Jan 25, 2006 at 10:17 UTC
|
use strict;
use warnings;
my $source= '[TCG]GGGG[AT]'; # ambiguous
my $target1 = 'AGGGGC'; # No of mismatch = 2 on position 1 and 6
my $target2 = 'TGGGGC'; # No of mismatch = 1 on position 6 only
my @tar = split //, $target1;
my @t = @tar;
for my $i (1..($#t+1))
{
my $a ='';
if ($source !~ /^\[/) #not ambiguous
{
($a) = $source =~ /^(.)/s;
$source =~ s/^.//s;
my $s = shift @tar;
($s eq $a) ? print "$s matches position $i in target\n" : prin
+t "$s not matches position $i in target\n";
}
else #ambiguous
{
($a) = $source =~ /^\[([^\]]*)/s;
my @a = split //, $a;
$source =~ s/^([^\]]*)\]//s;
my $s = shift @tar;
(my @m = grep /$s/, @a) ? print "[@a] matches position $i in t
+arget\n" : print "[@a] not matches position $i in target\n";
}
}
| [reply] [d/l] |
Re: Mismatch Positions of Ambiguous String
by tweetiepooh (Hermit) on Jan 25, 2006 at 11:14 UTC
|
Maybe you need to think of the problem in reverse ie the target is the source and the source the target and see where that works.
target = [ACT]GCTATC[GC]
source = TGCTATCA
Then you can apply each character or group in the target as a regex to the source. The answer should be the same if as in examples they are the same length.
If not then you may need to keep some form of pointer to the actual place in the sequence (this looks like gene/DNA sequences).
| [reply] [d/l] |
Re: Mismatch Positions of Ambiguous String
by Samy_rio (Vicar) on Jan 25, 2006 at 10:52 UTC
|
Hi monkfan, If i understood your question correctly, it helps you.
use strict;
my %compare = ('[TCG]GGGG[AT]'=>['AGGGGC', 'TGGGGC'], 'TGGGGT'=>['AGGG
+GC','AGGGGC']);
for my $ky (keys (%compare))
{
my @source = split /GGGG/, $ky;
my @arr = @{$compare{$ky}};
print "\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n";
print "\nSource:\t", $ky;
print "\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n";
for my $t (0..$#arr){
my @tar = split /GGGG/,$arr[$t];
print "\nTarget:\t", $arr[$t], "\n";
my $mis = 0;
for my $po (0..$#source){
if ($tar[$po] !~ m/$source[$po]/){
$mis++;
print "\nPosition for Mismatch : ",$po + 1 if ($po ==
+0);
print "\nPosition for Mismatch : ",$po + 5 if ($po ==
+1);
}
}
print "\nNo of Mismatch = $mis\n";
}
print "\n+++++++++++++++++++++++++++++\n";
}
__END__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Source: [TCG]GGGG[AT]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Target: AGGGGC
Position for Mismatch : 1
Position for Mismatch : 6
No of Mismatch = 2
Target: TGGGGC
Position for Mismatch : 6
No of Mismatch = 1
+++++++++++++++++++++++++++++
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Source: TGGGGT
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Target: AGGGGC
Position for Mismatch : 1
Position for Mismatch : 6
No of Mismatch = 2
Target: AGGGGC
Position for Mismatch : 1
Position for Mismatch : 6
No of Mismatch = 2
+++++++++++++++++++++++++++++
Regards, Velusamy R. eval"print uc\"\\c$_\""for split'','j)@,/6%@0%2,`e@3!-9v2)/@|6%,53!-9@2~j';
| [reply] [d/l] [select] |
Re: Mismatch Positions of Ambiguous String
by monkey_boy (Priest) on Jan 25, 2006 at 09:56 UTC
|
| [reply] [d/l] |
Re: Mismatch Positions of Ambiguous String
by murugu (Curate) on Jan 25, 2006 at 15:45 UTC
|
use strict;
use warnings;
my $str = 'TTCTTCT';
my $tar = 'ATGTTCT';
my $i =0;
while (length($str)) {
if (substr($str,0,1) eq '[') {
my $index = index $str,']';
my $m = substr($str,0,$index+1);
print "Mismatch at index $i\n" unless (substr($tar,$i,1)=~/$m/)
+;
substr($str,0,$index+1)="";
}
else {
print "Mismatch at index $i\n" unless (substr($str,0,1) eq subs
+tr($tar,$i,1));
substr($str,0,1)="";
}
$i++;
}
Regards, Murugesan Kandasamy use perl for(;;);
| [reply] [d/l] |
Re: Mismatch Positions of Ambiguous String
by Roy Johnson (Monsignor) on Jan 25, 2006 at 19:14 UTC
|
use strict;
use warnings;
my $source1 = '[TCG]GGGG[AT]'; # ambiguous
my $target1 = 'AGGGGC'; # No of mismatch = 2 on position 1 and 6
my $target2 = 'TGGGGC'; # No of mismatch = 1 on position 6 only
# Turn the source into a regexp that accepts any character at each
# position, but only captures the matches
(my $sourcepat = $source1) =~ s/(\[.*?\]|.)/(?:($1)|.)/g;
for ($target1, $target2) {
# Substitute _ for undef in the matches returned
my @result = map defined()?$_:'_', /$sourcepat/;
print "$_: @result\n";
}
__DATA__
Output is:
AGGGGC: _ G G G G _
TGGGGC: T G G G G _
Caution: Contents may have been coded under pressure.
| [reply] [d/l] |
|
|