Comparing array of aligned sequences

newtoperlprog has asked for the wisdom of the Perl Monks concerning the following question:

Dear all, I am new to perl and trying to learn the various concepts related to the language. I am trying to parse a aligned dna sequences and printing when each position the alphabets are same. If there is a mismatch then it should skip that and print the next consnsus sequence in new line. Here I am posting the code and sample file. Any help will be greatly appreciated. Thank you all

#!/usr/bin/perl
use warnings;
use strict;
my $seqcount; my $pos; my $arrlen;
my @arr = ();
open (B, "temp.dat");
while (my $line=<B>)
{
    chomp $line;
    $seqcount++;
    $line =~ s/\s//g;
    my @temp = split (//, $line);
    $arrlen = scalar(@temp);
    for ($pos=0;$pos<=scalar(@temp);$pos++)
    {
        $arr[$seqcount][$pos] = $temp[$pos];
    }
}
my $max_position = 0;
$max_position = $arrlen if($arrlen > $max_position);
for ($pos=0;$pos<=$max_position;$pos++)
{
    for (my $s=1;$s<=$seqcount;$s++)
    {
        if ($arr[$s][$pos] ne $arr[$seqcount][$pos])
        {
            print "\n";            
            next;
        }
        else
        {
            print "$arr[$s][$pos]";
        }
    }
}
[download]

temp.dat
atggctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg
+aatcctgcggacgacccctctcgtggtcgcttggggctctgccgccctcttctccgcctgccgttccgg
+c
atagctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg
+aatcctgcggacgacccctctcgtggtcgtttggggctctgccgccctcttctccgcctgccgttcagg
+c
atggctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg
+aatcctgcggacgacccctctcgtggtcgcttggggctctgccgccctcttctccgcctgccgttccgg
+c
atggctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg
+aatcctgcggacgacccctctcgtggtcgcttggggctctaccgccctcttctccgcctgccgttccgg
+c
[download]

Desired output
at
gctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctgaat
+cctgcggacgacccctctcgtggtcg
ttggggctct
ccgccctcttctccgcctgccgttc
ggc
[download]

Comment on Comparing array of aligned sequences Select or Download Code

Replies are listed 'Best First'.
Re: Comparing array of aligned sequences by toolic (Bishop) on Jul 11, 2014 at 16:26 UTC
Not being a bio guy, I don't understand your transfer function. However, when I run your code I get several `Use of uninitialized value in string`. Using Tip #4 from the Basic debugging checklist (Data::Dumper), I see some `undef` values in your array. Try to eliminate those first. `use Data::Dumper; print Dumper(\@arr)` [download] Arrays in Perl start with index=0 by default, but you start filling the array at index=1. Maybe you want `$seqcount++;` at the end of your while loop? If you just want your input file in an array-of-arrays (perldoc perdsc), this is simpler: `my @arr; open (B, "temp.dat"); while (my $line=<B>) { chomp $line; $line =~ s/\s//g; my @temp = split //, $line; push @arr, [@temp]; }` [download]	[reply] [d/l] [select]
Re^2: Comparing array of aligned sequences by newtoperlprog (Sexton) on Jul 11, 2014 at 17:05 UTC
Thank you for your time and reply. I did some modifcations and now the program is printing 3 times each value on match. I guess I am not putting the print at the right place. #!/usr/bin/perl use warnings; use strict; use Data::Dumper; my $seqcount; my $pos; my $arrlen; my @arr = (); open (B, "temp.dat"); while (my $line=<B>) { chomp $line; $seqcount++; $line =~ s/\s//g; my @temp = split (//, $line); $arrlen = scalar(@temp); for ($pos=0;$pos<=$#temp;$pos++) { $arr[$seqcount][$pos] = $temp[$pos]; # print "$pos\t$arr[$seqcount][$pos]\n"; } } my $max_position = 0; $max_position = $arrlen if($arrlen > $max_position); for ($pos=0;$pos<=$max_position;$pos++) { for (my $s=1;$s<=$seqcount;$s++) { # print "$pos\t$s\t$arr[$s][$pos]\n"; if ($arr[$s][$pos] ne $arr[$seqcount][$pos]) { print "\n"; next; } else { print "$arr[$s][$pos]"; } } print "\n"; } [download] Do I need to push in another array after there is a match of values? Thanks for your help.	[reply] [d/l]
Re: Comparing array of aligned sequences by johngg (Canon) on Jul 11, 2014 at 16:55 UTC
If I've understood correctly I think one approach would break your sequences down into arrays of single letters and shift and compare a letter from each at a time. use strict; use warnings; use 5.014; open my $inFH, q{<}, \ <<EOD or die $!; atggctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgcttggggctctgccgccctcttctccgcctgccgttccgg +c atagctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgtttggggctctgccgccctcttctccgcctgccgttcagg +c atggctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgcttggggctctgccgccctcttctccgcctgccgttccgg +c atggctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgcttggggctctaccgccctcttctccgcctgccgttccgg +c EOD chomp( my @sequences = <$inFH> ); close $inFH or die $!; my @letters = map { [ split m{} ] } @sequences; my $consensus = q{}; while ( @{ $letters[ 0 ] } ) { my $seq0 = shift @{ $letters[ 0 ] }; my $seq1 = shift @{ $letters[ 1 ] }; my $seq2 = shift @{ $letters[ 2 ] }; my $seq3 = shift @{ $letters[ 3 ] }; if ( $seq0 eq $seq1 and $seq0 eq $seq2 and $seq0 eq $seq3 ) { $consensus .= $seq0; } else { emitConsensus(); } } emitConsensus(); sub emitConsensus { say $consensus if length $consensus; $consensus = q{}; } [download] The output. `at gctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctgaat +cctgcggacgacccctctcgtggtcg ttggggctct ccgccctcttctccgcctgccgttc ggc` [download] I hope I have understood correctly and this is helpful. Cheers, JohnGG	[reply] [d/l] [select]
Re^2: Comparing array of aligned sequences by newtoperlprog (Sexton) on Jul 11, 2014 at 17:35 UTC
Dear JohnGG, Thank you for your reply and suggestions. The sample data which i provided only has 4 sequences whereas in reality i have files which have greater than 25,000 sequences. In that situation I guess I might have to use loops to do the checking. Kindly let me know if I am thinking in the right direction.	[reply]
Re^3: Comparing array of aligned sequences by johngg (Canon) on Jul 11, 2014 at 21:19 UTC
Given the number of sequence lines in your files processing line by line would be the better option. Reading the first sequence into a string that we will compare with subsequent sequences one at a time, modifying the string as we go would be the approach I'd now adopt. The `flagDiffs()` subroutine takes copies of the base sequence string and the next sequence read from file as arguments and `XOR`s them. The resultant string will have `\0` (NULL) characters wherever characters in the two sequences matched. It then uses a regular expression and pos to find non-NULL characters, i.e. non-matches. Finally it modifies the base sequence string by substituting an 'X' at the positions where there were non-matches and returns it, the returned sttring being assigned back to the base sequence string. Here is a command-line example of XOR'ing two strings to demonstrate the process. `$ perl -E ' $s1 = q{gctgc}; $s2 = q{gctcc}; $x = $s1 ^ $s2; say for map sprintf( q{0x%02x}, ord ), split m{}, $x;' 0x00 0x00 0x00 0x04 0x00 $` [download] Once all lines have been processed the base sequence string can be split on one or more 'X' characters to find the consensus strings. use strict; use warnings; use 5.014; open my $inFH, q{<}, \ <<EOD or die $!; atggctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgcttggggctctgccgccctcttctccgcctgccgttccgg +c atagctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgtttggggctctgccgccctcttctccgcctgccgttcagg +c atggctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgcttggggctctgccgccctcttctccgcctgccgttccgg +c atggctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgcttggggctctaccgccctcttctccgcctgccgttccgg +c EOD my $baseStr = <$inFH>; chomp $baseStr; while ( <$inFH> ) { chomp; $baseStr = flagDiffs( $baseStr, $_ ); } close $inFH or die $!; say for split m{X+}, $baseStr; sub flagDiffs { my( $baseStr, $nextStr ) = @_; my $diff = $baseStr ^ $nextStr; my @posns; push @posns, pos $diff while $diff =~ m{(?=[^\0])}g; substr $baseStr, $_, 1, q{X} for @posns; return $baseStr; } [download] The output. `at gctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctgaat +cctgcggacgacccctctcgtggtcg ttggggctct ccgccctcttctccgcctgccgttc ggc` [download] Without a 25,000 sequence file to test on I don't know whether this approach will perform but it seems to give the expected results with the sample in your OP. I hope this is helpful. Update: Added the command-line XOR example. Update 2: There's no need to keep updating the base sequence string as we go, just keep a record, as keys of a hash, of where differences are found and modify it after all lines have been processed. Also there's no need to chomp each line (as long as all the sequences are the same length), just do it at the end to the base string. If sequences differ in length then you are into pre-processing to find the shortest or longest sequence then either truncating to the shortest or padding to the longest. New code, the output is the same. use strict; use warnings; use 5.014; open my $inFH, q{<}, \ <<EOD or die $!; atggctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgcttggggctctgccgccctcttctccgcctgccgttccgg +c atagctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgtttggggctctgccgccctcttctccgcctgccgttcagg +c atggctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgcttggggctctgccgccctcttctccgcctgccgttccgg +c atggctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgcttggggctctaccgccctcttctccgcctgccgttccgg +c EOD my $baseStr = <$inFH>; my %diffPosns; while ( <$inFH> ) { findDiffPosns( $baseStr, $_ ); } close $inFH or die $!; chomp $baseStr; substr $baseStr, $_, 1, q{X} for keys %diffPosns; say for split m{X+}, $baseStr; sub findDiffPosns { my( $baseStr, $nextStr ) = @_; my $diff = $baseStr ^ $nextStr; $diffPosns{ pos $diff } ++ while $diff =~ m{(?=[^\0])}g; } [download] Cheers, JohnGG	[reply] [d/l] [select]
Re^4: Comparing array of aligned sequences by newtoperlprog (Sexton) on Jul 14, 2014 at 21:26 UTC
Re^5: Comparing array of aligned sequences by johngg (Canon) on Jul 14, 2014 at 22:27 UTC
Some notes below your chosen depth have not been shown here