Re: Mismatch Positions of Ambiguous String

I was going to recommend String::Approx, but it turns out that it doesn't yet support regexp patterns. Here's a simple approach to doing it yourself:

  sub mismatches {
    my($source, $target) = @_;
    my @sparts = ($source =~ /(\[.*?\]|.)/g);
    scalar grep substr($target, $_, 1) !~ /^$sparts[$_]/, 0 .. $#spart
+s;
  }
[download]

This assumes source and target are the same length; if you're going to be searching for the closest match within a longer target you'll need a bit more work, and if you need decent performance on very large strings you'll almost certainly want to code it in C/XS.

Hugo

Comment on Re: Mismatch Positions of Ambiguous String Download Code

Replies are listed 'Best First'.
Re^2: Mismatch Positions of Ambiguous String by monkfan (Curate) on Apr 26, 2006 at 09:23 UTC
Dear hv, Sorry for having have to come back to you again. How can I extend your code above, such that it can handle the target which is also ambiguous like this: `my $source1 = '[TCG]GGGG[AT]'; # ambiguous my $target1 = 'AGGGG[CT]'; # No of mismatch = 1 only at position 1` [download] Also the size of source and target will be the same. I tried this: `sub mismatches { my($source, $target) = @_; my @sparts = ($source =~ /(\[.?\]\|.)/g); my @tparts = ($target =~ /(\[.?\]\|.)/g); scalar grep $tparts[$_] !~ /^$sparts[$_]/, 0 .. $#sparts; }` [download] But doesn't work. Where did I go wrong? Regards, Edward	[reply] [d/l] [select]
Re^3: Mismatch Positions of Ambiguous String by hv (Prior) on Apr 27, 2006 at 10:16 UTC
But doesn't work. Where did I go wrong? It isn't as simple as that: my original approach relied on treating each ambiguous `[ACGT]` in the source string as a regexp, but to do that it needs to know a) which is the fixed string and which the regexp, and b) that they are not both regexps. The simplest extension is as below, but this suffers further on speed - it will probably be too slow if you're dealing with strings a few thousand base pairs long: `sub mismatches { my($source, $target) = @_; my @sparts = ($source =~ /(\[.?\]\|.)/g); my @tparts = ($target =~ /(\[.?\]\|.)/g); scalar grep { my($s, $t) = ($sparts[$_], $tparts[$_]); $s !~ /\[/ ? ($s !~ /$t/) : $t !~ /\[/ ? ($t !~ /$s/) : !intersect($s, $t) } 0 .. $#sparts; } sub intersect { my($s, $t) = @_; my %seen = map +($_ => 1), $s =~ /[^\[\]]/g; scalar grep $seen{$_}, $t =~ /[^\[\]]/g; }` [download] This says: if source is not ambiguous, treat the corresponding fragment of the target as a regexp; else if the target is not ambiguous, treat the source fragment as a regexp; else check a full intersection of the two. If your strings only include ACGT, a more efficient approach would be to transform each string into a bit vector that sets a bit for each base that may be present: `my %bits = ('A' => 1, 'C' => 2, 'G' => 4, 'T' => 8); my $source1 = bitwise('[TCG]GGGG[AT]'); my $target1 = bitwise('AGGGG[CT]'); print mismatches($source, $target1), "\n"; sub mismatches { my($source, $target) = @_; ($source & $target) =~ tr/\0//; } sub bitwise { my $string = shift; join '', map { my $char = 0; $char \|= $bits{$_} for /[ACGT]/g; chr($char) } $string =~ /(\[.*?\]\|.)/g; }` [download] Once the strings are transformed into this bitwise representation, checking for mismatches is very fast even with long strings. Hugo	[reply] [d/l] [select]


No such thing as a small change
	PerlMonks