match and mismatch

heidi has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: match and mismatch by GrandFather (Saint) on Nov 14, 2008 at 08:18 UTC
There is a neat trick for finding difference between similar equal length ASCII strings: xor them together and any non-zero bytes are different. Consider: `use strict; use warnings; my $one= "AGCTGATCGAGCTAGTACCCTAGCTC"; my $two= "AGCTGATCGAGCTAGTACCCTATCTC"; my $diff = $one ^ $two; $diff =~ tr/\0/x/c; my $start = -1; while (-1 < ($start = index $diff, 'x', ++$start)) { print "Difference at $start\n"; }` [download] Prints: `Difference at 22` [download] The tr/// changes non-zero bytes to x. The while loop then uses index to search through the difference string for the x bytes and reports their index (0 based position). Perl reduces RSI - it saves typing	[reply] [d/l] [select]
Re^2: match and mismatch by johngg (Canon) on Nov 14, 2008 at 10:35 UTC
The tr/// changes non-zero bytes to x. The while loop then uses index to search through the difference string for the x bytes and reports their index (0 based position). I think it might be simpler to use a global regex match for non-nulls and pos. You can just match them for 1-based counting or put them in a look-ahead for 0-based counting. Note that I added another difference in the third characters to check that it was working as I had hoped. `use strict; use warnings; my $one = "AGCTGATCGAGCTAGTACCCTAGCTC"; my $two = "AGATGATCGAGCTAGTACCCTATCTC"; # Diffs here ^ ^ my $diff = $one ^ $two; my @posns = (); push @posns, pos $diff while $diff =~ m{[^\0]}g; print qq{Differences, 1-based counting, at - @posns\n}; @posns = (); push @posns, pos $diff while $diff =~ m{(?=[^\0])}g; print qq{Differences, 0-based counting, at - @posns\n};` [download] The output. `Differences, 1-based counting, at - 3 23 Differences, 0-based counting, at - 2 22` [download] I hope this is of interest. Cheers, JohnGG	[reply] [d/l] [select]
Re^2: match and mismatch by heidi (Sexton) on Nov 17, 2008 at 03:25 UTC
i will write the complete problem i am facing. Here is the input file i am using. sxoght: #query hit score probability qstart qend qorien +tation tstart tend matches mismatches gapOpening gap +s @SNPSTER4_104_308EFAAXX:1:1:1694:128 GGGATAAGAGAGGTGCATGTTGGTATTTAAGGTAGT 1 alignment(s) -- reports limited to 10 alignment(s) sxoght: SNPSTER4_104_308EFAAXX:1:1:1694:128 gi\|122939163\|ref\|NM_00 +0165.3\| -10 1.000000 1 36 + 1595 163 +0 35 1 00 Score = -10, P(A\|R) = 1.000000 Query: 1 GGGATAAGAGAGGTGCATGTTGGTATTTAAGGTAGT 36 \|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\| \|\|\|\|\| Sbjct: 1595 GGGATAAGAGAGGTGCATGTTGGTATTTAAAGTAGT 1630 @SNPSTER4_104_308EFAAXX:1:1:1608:94 GCAGTTTTAAGTTATTAGTTTTTAAAATCAGTACTT 14 alignment(s) -- reports limited to 10 alignment(s) sxoght: SNPSTER4_104_308EFAAXX:1:1:1608:94 gi\|113412254\|ref\|XR_01 +8775.1\| 0 0.090884 1 36 + 1578 161 +3 36 0 00 Score = 0, P(A\|R) = 0.090884 Query: 1 GCAGTTTTAAGTTATTAGTTTTTAAAATCAGTACTT 36 \|\|\| \|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\| \| Sbjct: 1578 GCATTTTTAAGTTATTAGTTTTTAAAATCAGTACGT 1613 [download] this is a big file, though the whole file looks like this. What i am trying to do exactly is to grep the header (sxoght) for display in columns and also to display where there is a mismatch in the alignment between query and sbjct. for this input file, the expected results should look like: `>gi\|122939163\|ref\|NM_000165.3\| 1595 1630 SNPSTER4_104_308EFAA +XX:1:1:1694:128 1 36 36 -10 1 1.000000 35 mismatch : 1625.GA >gi\|113412254\|ref\|XR_018775.1\| 1578 1613 SNPSTER4_104_308EFAA +XX:1:1:1608:94 1 36 36 0 1 0.090884 36 mismatch : 1581.GT 1612.TG` [download] the code which i have written is : #!/usr/bin/perl open(FILE,"align.input") or die "can not open file"; while($var=<FILE>){ $str1=();$str2=(); if($var=~/^sxoght:/){ @ar=split(/\s+/,$var); print ">$ar[2]\t$ar[8]\t$ar[9]\t$ar[1]\t$ar[5]\t$ar[10]\t$ar[6]\t$ +ar[3]\t$ar[4]\t$ar[11]\n"; } if($var=~/^Query:/){ $str1=$var; $str1=~s/^Query:\s+//g; $str1=~s/\d+\s+//g; $str1=~s/\s+//g; } if($var=~/^Sbjct:/){ $str2=$var; $str2=~s/^Sbjct:\s+//g; $str2=~s/\d+\s+//g; $str2=~s/\s+//g; } for($i=0;$i<=length($str1);$i++) { if(substr($str1,$i,1) ne substr($str2,$i,1)){ print substr($str1,$i,1); print substr($str2,$i,1); print "$i\n"; } } } [download] I am not able to use "strict and warning" because using it doesnt allow me to access the scalar variable outside the loop. In my code, i m trying to extract the positions first, so that i will subtract it from the already stored @arr values of beginning and start. I am having problems with the for loop. I know where i am going wrong, but dont know how to correct it. PLEASEEEE HELP !!!	[reply] [d/l] [select]
Re^3: match and mismatch by bart (Canon) on Nov 17, 2008 at 03:57 UTC
You should only print the output if $str1 and $str2 are both set, and then, clear them so you don't print them again. The next is very close to your code, but kind of works. You only need to format the output as desired (I don't know where the large numbers come from): #!/usr/bin/perl open(FILE,"align.input") or die "can not open file: $!"; while($var=<FILE>){ if($var=~/^sxoght:/){ ($str1,$str2)=(); @ar=split(/\s+/,$var); print ">$ar[2]\t$ar[8]\t$ar[9]\t$ar[1]\t$ar[5]\t$ar[10]\t$ar[6 +]\t$ar[3]\t$ar[4]\t$ar[11]\n"; } if($var=~/^Query:/){ $str1=$var; $str1=~s/^Query:\s+//g; $str1=~s/\d+\s+//g; $str1=~s/\s+//g; } if($var=~/^Sbjct:/){ $str2=$var; $str2=~s/^Sbjct:\s+//g; $str2=~s/\d+\s+//g; $str2=~s/\s+//g; } if(defined $str1 and defined $str2) { for($i=0;$i<=length($str1);$i++) { if(substr($str1,$i,1) ne substr($str2,$i,1)){ # this is not in the desired format, yet print substr($str1,$i,1); print substr($str2,$i,1); print "$i\n"; } } ($str1,$str2)=(); } } [download] For the format, I propose to store the results in an array, and print "mismatch:" only if the array isn't empty at the end. `my @mismatch; for($i=0;$i<=length($str1);$i++) { if(substr($str1,$i,1) ne substr($str2,$i,1)){ push @mismatch, "$i." . substr($str1,$i,1) . substr($s +tr2,$i,1); } } if(@mismatch) { print "mismatch: @mismatch\n"; }` [download] After that modification, the output I get for this file is `>hit tstart tend #query qstart matches qend score + probability mismatches >gi\|122939163\|ref\|NM_000165.3\| 1595 1630 SNPSTER4_104_308EFAA +XX:1:1:1694:128 1 35 36 -10 1.000000 1 mismatch: 30.GA >gi\|113412254\|ref\|XR_018775.1\| 1578 1613 SNPSTER4_104_308EFAA +XX:1:1:1608:94 1 36 36 0 0.090884 0 mismatch: 3.GT 34.TG` [download] p.s. There's a possible speed improvement if you XOR (`^`) the two strings, you'll get a string of null bytes for where they are the same and non null where they are not: `my $xor = $str1 ^ $str2; while($xor =~ /[^\0]/g) { my $i = pos($xor) - 1; # or: $-[0] push @mismatch, "$i." . substr($str1,$i,1) . substr($str2, +$i,1); }` [download]	[reply] [d/l] [select]
Re^4: match and mismatch by heidi (Sexton) on Nov 17, 2008 at 04:58 UTC
Re^3: match and mismatch by Anonymous Monk on Feb 10, 2017 at 05:00 UTC
but u did not define here @var and @ar	[reply]
Re: match and mismatch by ccn (Vicar) on Nov 14, 2008 at 08:15 UTC
`my $one = "1 AGCTGATCGAGCTAGTACCCTAGCTC 26"; my $two = "15 AGCTGATCGAGCTAGTACCCTATCTC 40"; my @one = split '\s+', $one; my @two = split '\s+', $two; if ($one[1] ne $two[1]) { my $matchlen = ($one[1] ^ $two[1]) =~ /\A(\0+)/ ? length($1) : 0; print "Mismatch at ", $one[0] + $matchlen, ' and ', $two[0] + $matchlen, "\n"; } else { print "full match\n"; }` [download] see Bitwise-String-Operators	[reply] [d/l]
Re: match and mismatch by artist (Parson) on Nov 14, 2008 at 09:38 UTC
`use Algorithm::Diff qw(diff sdiff LCS traverse_sequences); + my @seq1= split //,"1 AGCTGATCGAGCTAGTACCCTAGCTC 26"; my @seq2 =split //,"15 AGCTGATCGAGCTAGTACCCTATCTC 40"; my @diffs = diff( \@seq1, \@seq2 ); foreach (@diffs) { foreach (@{$_}) { print join " " => @{$_},"\n"; } print "\n"; }` [download] And here is the output which shows the difference at each position if any. So on '25th' position, you have 'G' in one, and 'T' in the other. `+ 1 5 - 2 - 25 G + 25 T - 30 2 + 30 4 - 31 6 + 31 0` [download] --Artist	[reply] [d/l] [select]
Re^2: match and mismatch by lima1 (Curate) on Nov 14, 2008 at 17:49 UTC
Just for the sake of completeness: The runtime here is N^2. The identifcation of mismatches is obviously possible in N (see solutions above). But this solution is much more powerful and can identify insertions and deletions as well.	[reply]
Re: match and mismatch by apl (Monsignor) on Nov 14, 2008 at 12:47 UTC
If you show us the code you tried, we could show you where the mistake was.	[reply]
Re^2: match and mismatch by heidi (Sexton) on Nov 17, 2008 at 02:06 UTC
sure, i will write the complete problem i am facing. Here is the input file i am using. sxoght: #query hit score probability qstart qend qorien +tation tstart tend matches mismatches gapOpening gap +s @SNPSTER4_104_308EFAAXX:1:1:1694:128 GGGATAAGAGAGGTGCATGTTGGTATTTAAGGTAGT 1 alignment(s) -- reports limited to 10 alignment(s) sxoght: SNPSTER4_104_308EFAAXX:1:1:1694:128 gi\|122939163\|ref\|NM_00 +0165.3\| -10 1.000000 1 36 + 1595 163 +0 35 1 00 Score = -10, P(A\|R) = 1.000000 Query: 1 GGGATAAGAGAGGTGCATGTTGGTATTTAAGGTAGT 36 \|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\| \|\|\|\|\| Sbjct: 1595 GGGATAAGAGAGGTGCATGTTGGTATTTAAAGTAGT 1630 @SNPSTER4_104_308EFAAXX:1:1:1608:94 GCAGTTTTAAGTTATTAGTTTTTAAAATCAGTACTT 14 alignment(s) -- reports limited to 10 alignment(s) sxoght: SNPSTER4_104_308EFAAXX:1:1:1608:94 gi\|113412254\|ref\|XR_01 +8775.1\| 0 0.090884 1 36 + 1578 161 +3 36 0 00 Score = 0, P(A\|R) = 0.090884 Query: 1 GCAGTTTTAAGTTATTAGTTTTTAAAATCAGTACTT 36 \|\|\| \|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\| \| Sbjct: 1578 GCATTTTTAAGTTATTAGTTTTTAAAATCAGTACGT 1613 [download] this is a big file, though the whole file looks like this. What i am trying to do exactly is to grep the header (sxoght) for display in columns and also to display where there is a mismatch in the alignment between query and sbjct. for this input file, the expected results should look like: `>gi\|122939163\|ref\|NM_000165.3\| 1595 1630 SNPSTER4_104_308EFAA +XX:1:1:1694:128 1 36 36 -10 1 1.000000 35 mismatch : 1625.GA >gi\|113412254\|ref\|XR_018775.1\| 1578 1613 SNPSTER4_104_308EFAA +XX:1:1:1608:94 1 36 36 0 1 0.090884 36 mismatch : 1581.GT 1612.TG` [download] the code which i have written is : #!/usr/bin/perl open(FILE,"align.input") or die "can not open file"; while($var=<FILE>){ $str1=();$str2=(); if($var=~/^sxoght:/){ @ar=split(/\s+/,$var); print ">$ar[2]\t$ar[8]\t$ar[9]\t$ar[1]\t$ar[5]\t$ar[10]\t$ar[6]\t$ +ar[3]\t$ar[4]\t$ar[11]\n"; } if($var=~/^Query:/){ $str1=$var; $str1=~s/^Query:\s+//g; $str1=~s/\d+\s+//g; $str1=~s/\s+//g; } if($var=~/^Sbjct:/){ $str2=$var; $str2=~s/^Sbjct:\s+//g; $str2=~s/\d+\s+//g; $str2=~s/\s+//g; } for($i=0;$i<=length($str1);$i++) { if(substr($str1,$i,1) ne substr($str2,$i,1)){ print substr($str1,$i,1); print substr($str2,$i,1); print "$i\n"; } } } [download] I am not able to use "strict and warning" because using it doesnt allow me to access the scalar variable outside the loop. In my code, i m trying to extract the positions first, so that i will subtract it from the already stored @arr values of beginning and start. I am having problems with the for loop. I know where i am going wrong, but dont know how to correct it. PLEASEEEE HELP !!!	[reply] [d/l] [select]
Re^3: match and mismatch by apl (Monsignor) on Nov 20, 2008 at 13:23 UTC
I don't believe this is the complete program, because I don't see where your variables are being defined. If the `my` statements are inside a loop, by definition you can't see the variable outside the loop (and `use strict; use warnings;` are correcly telling you so.) So, please show the program you are trying to run, and we'll try this again.	[reply] [d/l] [select]


Perl-Sensitive Sunglasses
	PerlMonks