heidi has asked for the wisdom of the Perl Monks concerning the following question:
$one= "1 AGCTGATCGAGCTAGTACCCTAGCTC 26"
$two= "15 AGCTGATCGAGCTAGTACCCTATCTC 40"
tried finding the mismatch here and its position. The mismatch over here is G and T in position 23 and 37 respectively.used foreach loop and matched each element. Not succeeded :(
can anyone help with simpler logic?
Re: match and mismatch
by GrandFather (Saint) on Nov 14, 2008 at 08:18 UTC
|
There is a neat trick for finding difference between similar equal length ASCII strings: xor them together and any non-zero bytes are different. Consider:
use strict;
use warnings;
my $one= "AGCTGATCGAGCTAGTACCCTAGCTC";
my $two= "AGCTGATCGAGCTAGTACCCTATCTC";
my $diff = $one ^ $two;
$diff =~ tr/\0/x/c;
my $start = -1;
while (-1 < ($start = index $diff, 'x', ++$start)) {
print "Difference at $start\n";
}
Prints:
Difference at 22
The tr/// changes non-zero bytes to x. The while loop then uses index to search through the difference string for the x bytes and reports their index (0 based position).
Perl reduces RSI - it saves typing
| [reply] [d/l] [select] |
|
use strict;
use warnings;
my $one = "AGCTGATCGAGCTAGTACCCTAGCTC";
my $two = "AGATGATCGAGCTAGTACCCTATCTC";
# Diffs here ^ ^
my $diff = $one ^ $two;
my @posns = ();
push @posns, pos $diff while $diff =~ m{[^\0]}g;
print qq{Differences, 1-based counting, at - @posns\n};
@posns = ();
push @posns, pos $diff while $diff =~ m{(?=[^\0])}g;
print qq{Differences, 0-based counting, at - @posns\n};
The output.
Differences, 1-based counting, at - 3 23
Differences, 0-based counting, at - 2 22
I hope this is of interest.
Cheers, JohnGG | [reply] [d/l] [select] |
|
i will write the complete problem i am facing. Here is the input file i am using.
sxoght: #query hit score probability qstart qend qorien
+tation tstart tend matches mismatches gapOpening gap
+s
@SNPSTER4_104_308EFAAXX:1:1:1694:128
GGGATAAGAGAGGTGCATGTTGGTATTTAAGGTAGT
1 alignment(s) -- reports limited to 10 alignment(s)
sxoght: SNPSTER4_104_308EFAAXX:1:1:1694:128 gi|122939163|ref|NM_00
+0165.3| -10 1.000000 1 36 + 1595 163
+0 35 1 00
Score = -10, P(A|R) = 1.000000
Query: 1 GGGATAAGAGAGGTGCATGTTGGTATTTAAGGTAGT 36
|||||||||||||||||||||||||||||| |||||
Sbjct: 1595 GGGATAAGAGAGGTGCATGTTGGTATTTAAAGTAGT 1630
@SNPSTER4_104_308EFAAXX:1:1:1608:94
GCAGTTTTAAGTTATTAGTTTTTAAAATCAGTACTT
14 alignment(s) -- reports limited to 10 alignment(s)
sxoght: SNPSTER4_104_308EFAAXX:1:1:1608:94 gi|113412254|ref|XR_01
+8775.1| 0 0.090884 1 36 + 1578 161
+3 36 0 00
Score = 0, P(A|R) = 0.090884
Query: 1 GCAGTTTTAAGTTATTAGTTTTTAAAATCAGTACTT 36
||| |||||||||||||||||||||||||||||| |
Sbjct: 1578 GCATTTTTAAGTTATTAGTTTTTAAAATCAGTACGT 1613
this is a big file, though the whole file looks like this.
What i am trying to do exactly is to grep the header (sxoght) for display in columns and also to display where there is a mismatch in the alignment between query and sbjct. for this input file, the expected results should look like:
>gi|122939163|ref|NM_000165.3| 1595 1630 SNPSTER4_104_308EFAA
+XX:1:1:1694:128 1 36 36 -10 1 1.000000 35
mismatch : 1625.GA
>gi|113412254|ref|XR_018775.1| 1578 1613 SNPSTER4_104_308EFAA
+XX:1:1:1608:94 1 36 36 0 1 0.090884 36
mismatch : 1581.GT 1612.TG
the code which i have written is :
#!/usr/bin/perl
open(FILE,"align.input") or die "can not open file";
while($var=<FILE>){
$str1=();$str2=();
if($var=~/^sxoght:/){
@ar=split(/\s+/,$var);
print ">$ar[2]\t$ar[8]\t$ar[9]\t$ar[1]\t$ar[5]\t$ar[10]\t$ar[6]\t$
+ar[3]\t$ar[4]\t$ar[11]\n";
}
if($var=~/^Query:/){
$str1=$var;
$str1=~s/^Query:\s+//g;
$str1=~s/\d+\s+//g;
$str1=~s/\s+//g;
}
if($var=~/^Sbjct:/){
$str2=$var;
$str2=~s/^Sbjct:\s+//g;
$str2=~s/\d+\s+//g;
$str2=~s/\s+//g;
}
for($i=0;$i<=length($str1);$i++)
{
if(substr($str1,$i,1) ne substr($str2,$i,1)){
print substr($str1,$i,1);
print substr($str2,$i,1);
print "$i\n";
}
}
}
I am not able to use "strict and warning" because using it doesnt allow me to access the scalar variable outside the loop. In my code, i m trying to extract the positions first, so that i will subtract it from the already stored @arr values of beginning and start. I am having problems with the for loop. I know where i am going wrong, but dont know how to correct it. PLEASEEEE HELP !!! | [reply] [d/l] [select] |
|
#!/usr/bin/perl
open(FILE,"align.input") or die "can not open file: $!";
while($var=<FILE>){
if($var=~/^sxoght:/){
($str1,$str2)=();
@ar=split(/\s+/,$var);
print ">$ar[2]\t$ar[8]\t$ar[9]\t$ar[1]\t$ar[5]\t$ar[10]\t$ar[6
+]\t$ar[3]\t$ar[4]\t$ar[11]\n";
}
if($var=~/^Query:/){
$str1=$var;
$str1=~s/^Query:\s+//g;
$str1=~s/\d+\s+//g;
$str1=~s/\s+//g;
}
if($var=~/^Sbjct:/){
$str2=$var;
$str2=~s/^Sbjct:\s+//g;
$str2=~s/\d+\s+//g;
$str2=~s/\s+//g;
}
if(defined $str1 and defined $str2) {
for($i=0;$i<=length($str1);$i++)
{
if(substr($str1,$i,1) ne substr($str2,$i,1)){
# this is not in the desired format, yet
print substr($str1,$i,1);
print substr($str2,$i,1);
print "$i\n";
}
}
($str1,$str2)=();
}
}
For the format, I propose to store the results in an array, and print "mismatch:" only if the array isn't empty at the end.
my @mismatch;
for($i=0;$i<=length($str1);$i++)
{
if(substr($str1,$i,1) ne substr($str2,$i,1)){
push @mismatch, "$i." . substr($str1,$i,1) . substr($s
+tr2,$i,1);
}
}
if(@mismatch) {
print "mismatch: @mismatch\n";
}
After that modification, the output I get for this file is
>hit tstart tend #query qstart matches qend score
+ probability mismatches
>gi|122939163|ref|NM_000165.3| 1595 1630 SNPSTER4_104_308EFAA
+XX:1:1:1694:128 1 35 36 -10 1.000000 1
mismatch: 30.GA
>gi|113412254|ref|XR_018775.1| 1578 1613 SNPSTER4_104_308EFAA
+XX:1:1:1608:94 1 36 36 0 0.090884 0
mismatch: 3.GT 34.TG
p.s. There's a possible speed improvement if you XOR (^) the two strings, you'll get a string of null bytes for where they are the same and non null where they are not:
my $xor = $str1 ^ $str2;
while($xor =~ /[^\0]/g) {
my $i = pos($xor) - 1; # or: $-[0]
push @mismatch, "$i." . substr($str1,$i,1) . substr($str2,
+$i,1);
}
| [reply] [d/l] [select] |
|
|
but u did not define here @var and @ar
| [reply] |
Re: match and mismatch
by ccn (Vicar) on Nov 14, 2008 at 08:15 UTC
|
my $one = "1 AGCTGATCGAGCTAGTACCCTAGCTC 26";
my $two = "15 AGCTGATCGAGCTAGTACCCTATCTC 40";
my @one = split '\s+', $one;
my @two = split '\s+', $two;
if ($one[1] ne $two[1]) {
my $matchlen = ($one[1] ^ $two[1]) =~ /\A(\0+)/
? length($1)
: 0;
print "Mismatch at ",
$one[0] + $matchlen, ' and ',
$two[0] + $matchlen, "\n";
}
else {
print "full match\n";
}
see Bitwise-String-Operators | [reply] [d/l] |
Re: match and mismatch
by artist (Parson) on Nov 14, 2008 at 09:38 UTC
|
use Algorithm::Diff qw(diff sdiff LCS traverse_sequences);
+
my @seq1= split //,"1 AGCTGATCGAGCTAGTACCCTAGCTC 26";
my @seq2 =split //,"15 AGCTGATCGAGCTAGTACCCTATCTC 40";
my @diffs = diff( \@seq1, \@seq2 );
foreach (@diffs) {
foreach (@{$_}) {
print join " " => @{$_},"\n";
}
print "\n";
}
And here is the output which shows the difference at each position if any.
So on '25th' position, you have 'G' in one, and 'T' in the other.
+ 1 5
- 2
- 25 G
+ 25 T
- 30 2
+ 30 4
- 31 6
+ 31 0
| [reply] [d/l] [select] |
|
Just for the sake of completeness: The runtime here is N^2. The identifcation of mismatches is obviously possible in N (see solutions above). But this solution is much more powerful and can identify insertions and deletions as well.
| [reply] |
Re: match and mismatch
by apl (Monsignor) on Nov 14, 2008 at 12:47 UTC
|
If you show us the code you tried, we could show you where the mistake was. | [reply] |
|
sure, i will write the complete problem i am facing. Here is the input file i am using.
sxoght: #query hit score probability qstart qend qorien
+tation tstart tend matches mismatches gapOpening gap
+s
@SNPSTER4_104_308EFAAXX:1:1:1694:128
GGGATAAGAGAGGTGCATGTTGGTATTTAAGGTAGT
1 alignment(s) -- reports limited to 10 alignment(s)
sxoght: SNPSTER4_104_308EFAAXX:1:1:1694:128 gi|122939163|ref|NM_00
+0165.3| -10 1.000000 1 36 + 1595 163
+0 35 1 00
Score = -10, P(A|R) = 1.000000
Query: 1 GGGATAAGAGAGGTGCATGTTGGTATTTAAGGTAGT 36
|||||||||||||||||||||||||||||| |||||
Sbjct: 1595 GGGATAAGAGAGGTGCATGTTGGTATTTAAAGTAGT 1630
@SNPSTER4_104_308EFAAXX:1:1:1608:94
GCAGTTTTAAGTTATTAGTTTTTAAAATCAGTACTT
14 alignment(s) -- reports limited to 10 alignment(s)
sxoght: SNPSTER4_104_308EFAAXX:1:1:1608:94 gi|113412254|ref|XR_01
+8775.1| 0 0.090884 1 36 + 1578 161
+3 36 0 00
Score = 0, P(A|R) = 0.090884
Query: 1 GCAGTTTTAAGTTATTAGTTTTTAAAATCAGTACTT 36
||| |||||||||||||||||||||||||||||| |
Sbjct: 1578 GCATTTTTAAGTTATTAGTTTTTAAAATCAGTACGT 1613
this is a big file, though the whole file looks like this.
What i am trying to do exactly is to grep the header (sxoght) for display in columns and also to display where there is a mismatch in the alignment between query and sbjct. for this input file, the expected results should look like:
>gi|122939163|ref|NM_000165.3| 1595 1630 SNPSTER4_104_308EFAA
+XX:1:1:1694:128 1 36 36 -10 1 1.000000 35
mismatch : 1625.GA
>gi|113412254|ref|XR_018775.1| 1578 1613 SNPSTER4_104_308EFAA
+XX:1:1:1608:94 1 36 36 0 1 0.090884 36
mismatch : 1581.GT 1612.TG
the code which i have written is :
#!/usr/bin/perl
open(FILE,"align.input") or die "can not open file";
while($var=<FILE>){
$str1=();$str2=();
if($var=~/^sxoght:/){
@ar=split(/\s+/,$var);
print ">$ar[2]\t$ar[8]\t$ar[9]\t$ar[1]\t$ar[5]\t$ar[10]\t$ar[6]\t$
+ar[3]\t$ar[4]\t$ar[11]\n";
}
if($var=~/^Query:/){
$str1=$var;
$str1=~s/^Query:\s+//g;
$str1=~s/\d+\s+//g;
$str1=~s/\s+//g;
}
if($var=~/^Sbjct:/){
$str2=$var;
$str2=~s/^Sbjct:\s+//g;
$str2=~s/\d+\s+//g;
$str2=~s/\s+//g;
}
for($i=0;$i<=length($str1);$i++)
{
if(substr($str1,$i,1) ne substr($str2,$i,1)){
print substr($str1,$i,1);
print substr($str2,$i,1);
print "$i\n";
}
}
}
I am not able to use "strict and warning" because using it doesnt allow me to access the scalar variable outside the loop. In my code, i m trying to extract the positions first, so that i will subtract it from the already stored @arr values of beginning and start. I am having problems with the for loop. I know where i am going wrong, but dont know how to correct it. PLEASEEEE HELP !!! | [reply] [d/l] [select] |
|
| [reply] [d/l] [select] |
|
|