Mismatch Positions of Ambiguous String

monkfan has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Mismatch Positions of Ambiguous String by GrandFather (Saint) on Jan 25, 2006 at 11:10 UTC
I like bitwise operators :) use strict; use warnings; my @targets = qw(AGGGGC TGGGGC); my @sources = (<DATA>); chomp @sources; for my $source (@sources) { my @fields = grep length, split /(\[[^[\]]+\]\|[^[\]]+)/, $source; my @cells = map {(index $_, '[') == 0 ? $_ : split ''} @fields; my $matchStr = join '', map {/\[/ ? "\x80" : $_} @fields; for my $target (@targets) { my @segments = grep length, split /(\x00+\|.)/, $target ^ $matchStr +; my $index = 0; for my $field (@segments) { next if length ($field) > 1 or $field eq "\x00"; # Only here for an ambiguous char my $str = $cells[$index]; my $chr = substr $target, $index, 1; next if $chr =~ /$str/; print "Mismatch at index $index in $source against $target\n"; } continue { $index += length $field; } } } __DATA__ [TCG]GGGG[AT] TGGGGT [AC][TCG]GGG[AT] [download] Prints: `Mismatch at index 0 in [TCG]GGGG[AT] against AGGGGC Mismatch at index 5 in [TCG]GGGG[AT] against AGGGGC Mismatch at index 5 in [TCG]GGGG[AT] against TGGGGC Mismatch at index 0 in TGGGGT against AGGGGC Mismatch at index 5 in TGGGGT against AGGGGC Mismatch at index 5 in TGGGGT against TGGGGC Mismatch at index 5 in [AC][TCG]GGG[AT] against AGGGGC Mismatch at index 0 in [AC][TCG]GGG[AT] against TGGGGC Mismatch at index 5 in [AC][TCG]GGG[AT] against TGGGGC` [download] DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re: Mismatch Positions of Ambiguous String by pKai (Priest) on Jan 25, 2006 at 13:09 UTC
Since characters in square brackets happen to be character classes in regexes, why not use them as such!? `use strict; use warnings; use Algorithm::Loops qw(MapCarE); MapCarE { my ($s, $t) = @_; my $x= $t =~ /$s/ ? '~' : '!'; print "$t $x~ $s\n"; } [shift =~ /\[\w+\]\|\w/g], # split source into chars or char classes [split //, shift]; # split target into chars` [download] `D:\temp>mm [TCG]GGGG[AT] AGGGGC A !~ [TCG] G ~~ G G ~~ G G ~~ G G ~~ G C !~ [AT] D:\temp>mm [TCG]GGGG[AT] TGGGGC T ~~ [TCG] G ~~ G G ~~ G G ~~ G G ~~ G C !~ [AT]` [download] Using Algorithm::Loops just for the convenience to iterate over two lists at once without distracting from the main point.	[reply] [d/l] [select]
Re: Mismatch Positions of Ambiguous String by hv (Prior) on Jan 25, 2006 at 10:15 UTC
I was going to recommend String::Approx, but it turns out that it doesn't yet support regexp patterns. Here's a simple approach to doing it yourself: `sub mismatches { my($source, $target) = @_; my @sparts = ($source =~ /(\[.*?\]\|.)/g); scalar grep substr($target, $_, 1) !~ /^$sparts[$_]/, 0 .. $#spart +s; }` [download] This assumes source and target are the same length; if you're going to be searching for the closest match within a longer target you'll need a bit more work, and if you need decent performance on very large strings you'll almost certainly want to code it in C/XS. Hugo	[reply] [d/l]
Re^2: Mismatch Positions of Ambiguous String by monkfan (Curate) on Apr 26, 2006 at 09:23 UTC
Dear hv, Sorry for having have to come back to you again. How can I extend your code above, such that it can handle the target which is also ambiguous like this: `my $source1 = '[TCG]GGGG[AT]'; # ambiguous my $target1 = 'AGGGG[CT]'; # No of mismatch = 1 only at position 1` [download] Also the size of source and target will be the same. I tried this: `sub mismatches { my($source, $target) = @_; my @sparts = ($source =~ /(\[.?\]\|.)/g); my @tparts = ($target =~ /(\[.?\]\|.)/g); scalar grep $tparts[$_] !~ /^$sparts[$_]/, 0 .. $#sparts; }` [download] But doesn't work. Where did I go wrong? Regards, Edward	[reply] [d/l] [select]
Re^3: Mismatch Positions of Ambiguous String by hv (Prior) on Apr 27, 2006 at 10:16 UTC
But doesn't work. Where did I go wrong? It isn't as simple as that: my original approach relied on treating each ambiguous `[ACGT]` in the source string as a regexp, but to do that it needs to know a) which is the fixed string and which the regexp, and b) that they are not both regexps. The simplest extension is as below, but this suffers further on speed - it will probably be too slow if you're dealing with strings a few thousand base pairs long: `sub mismatches { my($source, $target) = @_; my @sparts = ($source =~ /(\[.?\]\|.)/g); my @tparts = ($target =~ /(\[.?\]\|.)/g); scalar grep { my($s, $t) = ($sparts[$_], $tparts[$_]); $s !~ /\[/ ? ($s !~ /$t/) : $t !~ /\[/ ? ($t !~ /$s/) : !intersect($s, $t) } 0 .. $#sparts; } sub intersect { my($s, $t) = @_; my %seen = map +($_ => 1), $s =~ /[^\[\]]/g; scalar grep $seen{$_}, $t =~ /[^\[\]]/g; }` [download] This says: if source is not ambiguous, treat the corresponding fragment of the target as a regexp; else if the target is not ambiguous, treat the source fragment as a regexp; else check a full intersection of the two. If your strings only include ACGT, a more efficient approach would be to transform each string into a bit vector that sets a bit for each base that may be present: `my %bits = ('A' => 1, 'C' => 2, 'G' => 4, 'T' => 8); my $source1 = bitwise('[TCG]GGGG[AT]'); my $target1 = bitwise('AGGGG[CT]'); print mismatches($source, $target1), "\n"; sub mismatches { my($source, $target) = @_; ($source & $target) =~ tr/\0//; } sub bitwise { my $string = shift; join '', map { my $char = 0; $char \|= $bits{$_} for /[ACGT]/g; chr($char) } $string =~ /(\[.*?\]\|.)/g; }` [download] Once the strings are transformed into this bitwise representation, checking for mismatches is very fast even with long strings. Hugo	[reply] [d/l] [select]
Re: Mismatch Positions of Ambiguous String by prasadbabu (Prior) on Jan 25, 2006 at 10:17 UTC
Here is my try for both ambiguous and not ambiguous strings. But this can be still simplified. use strict; use warnings; my $source= '[TCG]GGGG[AT]'; # ambiguous my $target1 = 'AGGGGC'; # No of mismatch = 2 on position 1 and 6 my $target2 = 'TGGGGC'; # No of mismatch = 1 on position 6 only my @tar = split //, $target1; my @t = @tar; for my $i (1..($#t+1)) { my $a =''; if ($source !~ /^\[/) #not ambiguous { ($a) = $source =~ /^(.)/s; $source =~ s/^.//s; my $s = shift @tar; ($s eq $a) ? print "$s matches position $i in target\n" : prin +t "$s not matches position $i in target\n"; } else #ambiguous { ($a) = $source =~ /^\[([^\]])/s; my @a = split //, $a; $source =~ s/^([^\]])\]//s; my $s = shift @tar; (my @m = grep /$s/, @a) ? print "[@a] matches position $i in t +arget\n" : print "[@a] not matches position $i in target\n"; } } [download] Prasad	[reply] [d/l]
Re: Mismatch Positions of Ambiguous String by tweetiepooh (Hermit) on Jan 25, 2006 at 11:14 UTC
Maybe you need to think of the problem in reverse ie the target is the source and the source the target and see where that works. `target = [ACT]GCTATC[GC] source = TGCTATCA` [download] Then you can apply each character or group in the target as a regex to the source. The answer should be the same if as in examples they are the same length. If not then you may need to keep some form of pointer to the actual place in the sequence (this looks like gene/DNA sequences).	[reply] [d/l]
Re: Mismatch Positions of Ambiguous String by Samy_rio (Vicar) on Jan 25, 2006 at 10:52 UTC
Hi monkfan, If i understood your question correctly, it helps you. use strict; my %compare = ('[TCG]GGGG[AT]'=>['AGGGGC', 'TGGGGC'], 'TGGGGT'=>['AGGG +GC','AGGGGC']); for my $ky (keys (%compare)) { my @source = split /GGGG/, $ky; my @arr = @{$compare{$ky}}; print "\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"; print "\nSource:\t", $ky; print "\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"; for my $t (0..$#arr){ my @tar = split /GGGG/,$arr[$t]; print "\nTarget:\t", $arr[$t], "\n"; my $mis = 0; for my $po (0..$#source){ if ($tar[$po] !~ m/$source[$po]/){ $mis++; print "\nPosition for Mismatch : ",$po + 1 if ($po == +0); print "\nPosition for Mismatch : ",$po + 5 if ($po == +1); } } print "\nNo of Mismatch = $mis\n"; } print "\n+++++++++++++++++++++++++++++\n"; } __END__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Source: [TCG]GGGG[AT] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Target: AGGGGC Position for Mismatch : 1 Position for Mismatch : 6 No of Mismatch = 2 Target: TGGGGC Position for Mismatch : 6 No of Mismatch = 1 +++++++++++++++++++++++++++++ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Source: TGGGGT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Target: AGGGGC Position for Mismatch : 1 Position for Mismatch : 6 No of Mismatch = 2 Target: AGGGGC Position for Mismatch : 1 Position for Mismatch : 6 No of Mismatch = 2 +++++++++++++++++++++++++++++ [download] Regards, Velusamy R. eval"print uc\"\\c$_\""for split'','j)@,/6%@0%2,`e@3!-9v2)/@\|6%,53!-9@2~j';	[reply] [d/l] [select]
Re: Mismatch Positions of Ambiguous String by monkey_boy (Priest) on Jan 25, 2006 at 09:56 UTC
As a first try, i would just permutate your ambiguose string producing all possible strings, then do your bitwise comparrison of all source strings vs the target string and take the lowest resulting score, this may not scale well , depending on the size of your ambigous string. This is not a Signature...	[reply] [d/l]
Re: Mismatch Positions of Ambiguous String by murugu (Curate) on Jan 25, 2006 at 15:45 UTC
Hi, Here is my substring approach. `use strict; use warnings; my $str = 'TTCTTCT'; my $tar = 'ATGTTCT'; my $i =0; while (length($str)) { if (substr($str,0,1) eq '[') { my $index = index $str,']'; my $m = substr($str,0,$index+1); print "Mismatch at index $i\n" unless (substr($tar,$i,1)=~/$m/) +; substr($str,0,$index+1)=""; } else { print "Mismatch at index $i\n" unless (substr($str,0,1) eq subs +tr($tar,$i,1)); substr($str,0,1)=""; } $i++; }` [download] Regards, Murugesan Kandasamy use perl for(;;);	[reply] [d/l]
Re: Mismatch Positions of Ambiguous String by Roy Johnson (Monsignor) on Jan 25, 2006 at 19:14 UTC
A little different way: use strict; use warnings; my $source1 = '[TCG]GGGG[AT]'; # ambiguous my $target1 = 'AGGGGC'; # No of mismatch = 2 on position 1 and 6 my $target2 = 'TGGGGC'; # No of mismatch = 1 on position 6 only # Turn the source into a regexp that accepts any character at each # position, but only captures the matches (my $sourcepat = $source1) =~ s/(\[.?\]\|.)/(?:($1)\|.)/g; for ($target1, $target2) { # Substitute _ for undef in the matches returned my @result = map defined()?$_:'_', /$sourcepat/; print "$_: @result\n"; } __DATA__ Output is: AGGGGC: _ G G G G _ TGGGGC: T G G G G _ [download] Caution:* Contents may have been coded under pressure.	[reply] [d/l]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks