Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Comparing array of aligned sequences

by newtoperlprog (Sexton)
on Jul 11, 2014 at 16:06 UTC ( [id://1093245]=perlquestion: print w/replies, xml ) Need Help??

newtoperlprog has asked for the wisdom of the Perl Monks concerning the following question:

Dear all, I am new to perl and trying to learn the various concepts related to the language. I am trying to parse a aligned dna sequences and printing when each position the alphabets are same. If there is a mismatch then it should skip that and print the next consnsus sequence in new line. Here I am posting the code and sample file. Any help will be greatly appreciated. Thank you all

#!/usr/bin/perl use warnings; use strict; my $seqcount; my $pos; my $arrlen; my @arr = (); open (B, "temp.dat"); while (my $line=<B>) { chomp $line; $seqcount++; $line =~ s/\s//g; my @temp = split (//, $line); $arrlen = scalar(@temp); for ($pos=0;$pos<=scalar(@temp);$pos++) { $arr[$seqcount][$pos] = $temp[$pos]; } } my $max_position = 0; $max_position = $arrlen if($arrlen > $max_position); for ($pos=0;$pos<=$max_position;$pos++) { for (my $s=1;$s<=$seqcount;$s++) { if ($arr[$s][$pos] ne $arr[$seqcount][$pos]) { print "\n"; next; } else { print "$arr[$s][$pos]"; } } }
temp.dat atggctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgcttggggctctgccgccctcttctccgcctgccgttccgg +c atagctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgtttggggctctgccgccctcttctccgcctgccgttcagg +c atggctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgcttggggctctgccgccctcttctccgcctgccgttccgg +c atggctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgcttggggctctaccgccctcttctccgcctgccgttccgg +c
Desired output at gctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctgaat +cctgcggacgacccctctcgtggtcg ttggggctct ccgccctcttctccgcctgccgttc ggc

Replies are listed 'Best First'.
Re: Comparing array of aligned sequences
by toolic (Bishop) on Jul 11, 2014 at 16:26 UTC
    Not being a bio guy, I don't understand your transfer function. However, when I run your code I get several Use of uninitialized value in string. Using Tip #4 from the Basic debugging checklist (Data::Dumper), I see some undef values in your array. Try to eliminate those first.
    use Data::Dumper; print Dumper(\@arr)

    Arrays in Perl start with index=0 by default, but you start filling the array at index=1. Maybe you want $seqcount++; at the end of your while loop?

    If you just want your input file in an array-of-arrays (perldoc perdsc), this is simpler:

    my @arr; open (B, "temp.dat"); while (my $line=<B>) { chomp $line; $line =~ s/\s//g; my @temp = split //, $line; push @arr, [@temp]; }

      Thank you for your time and reply. I did some modifcations and now the program is printing 3 times each value on match. I guess I am not putting the print at the right place.

      #!/usr/bin/perl use warnings; use strict; use Data::Dumper; my $seqcount; my $pos; my $arrlen; my @arr = (); open (B, "temp.dat"); while (my $line=<B>) { chomp $line; $seqcount++; $line =~ s/\s//g; my @temp = split (//, $line); $arrlen = scalar(@temp); for ($pos=0;$pos<=$#temp;$pos++) { $arr[$seqcount][$pos] = $temp[$pos]; # print "$pos\t$arr[$seqcount][$pos]\n"; } } my $max_position = 0; $max_position = $arrlen if($arrlen > $max_position); for ($pos=0;$pos<=$max_position;$pos++) { for (my $s=1;$s<=$seqcount;$s++) { # print "$pos\t$s\t$arr[$s][$pos]\n"; if ($arr[$s][$pos] ne $arr[$seqcount][$pos]) { print "\n"; next; } else { print "$arr[$s][$pos]"; } } print "\n"; }
      Do I need to push in another array after there is a match of values? Thanks for your help.
Re: Comparing array of aligned sequences
by johngg (Canon) on Jul 11, 2014 at 16:55 UTC

    If I've understood correctly I think one approach would break your sequences down into arrays of single letters and shift and compare a letter from each at a time.

    use strict; use warnings; use 5.014; open my $inFH, q{<}, \ <<EOD or die $!; atggctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgcttggggctctgccgccctcttctccgcctgccgttccgg +c atagctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgtttggggctctgccgccctcttctccgcctgccgttcagg +c atggctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgcttggggctctgccgccctcttctccgcctgccgttccgg +c atggctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgcttggggctctaccgccctcttctccgcctgccgttccgg +c EOD chomp( my @sequences = <$inFH> ); close $inFH or die $!; my @letters = map { [ split m{} ] } @sequences; my $consensus = q{}; while ( @{ $letters[ 0 ] } ) { my $seq0 = shift @{ $letters[ 0 ] }; my $seq1 = shift @{ $letters[ 1 ] }; my $seq2 = shift @{ $letters[ 2 ] }; my $seq3 = shift @{ $letters[ 3 ] }; if ( $seq0 eq $seq1 and $seq0 eq $seq2 and $seq0 eq $seq3 ) { $consensus .= $seq0; } else { emitConsensus(); } } emitConsensus(); sub emitConsensus { say $consensus if length $consensus; $consensus = q{}; }

    The output.

    at gctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctgaat +cctgcggacgacccctctcgtggtcg ttggggctct ccgccctcttctccgcctgccgttc ggc

    I hope I have understood correctly and this is helpful.

    Cheers,

    JohnGG

      Dear JohnGG, Thank you for your reply and suggestions. The sample data which i provided only has 4 sequences whereas in reality i have files which have greater than 25,000 sequences. In that situation I guess I might have to use loops to do the checking. Kindly let me know if I am thinking in the right direction.

        Given the number of sequence lines in your files processing line by line would be the better option. Reading the first sequence into a string that we will compare with subsequent sequences one at a time, modifying the string as we go would be the approach I'd now adopt.

        The flagDiffs() subroutine takes copies of the base sequence string and the next sequence read from file as arguments and XORs them. The resultant string will have \0 (NULL) characters wherever characters in the two sequences matched. It then uses a regular expression and pos to find non-NULL characters, i.e. non-matches. Finally it modifies the base sequence string by substituting an 'X' at the positions where there were non-matches and returns it, the returned sttring being assigned back to the base sequence string. Here is a command-line example of XOR'ing two strings to demonstrate the process.

        $ perl -E ' $s1 = q{gctgc}; $s2 = q{gctcc}; $x = $s1 ^ $s2; say for map sprintf( q{0x%02x}, ord ), split m{}, $x;' 0x00 0x00 0x00 0x04 0x00 $

        Once all lines have been processed the base sequence string can be split on one or more 'X' characters to find the consensus strings.

        use strict; use warnings; use 5.014; open my $inFH, q{<}, \ <<EOD or die $!; atggctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgcttggggctctgccgccctcttctccgcctgccgttccgg +c atagctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgtttggggctctgccgccctcttctccgcctgccgttcagg +c atggctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgcttggggctctgccgccctcttctccgcctgccgttccgg +c atggctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgcttggggctctaccgccctcttctccgcctgccgttccgg +c EOD my $baseStr = <$inFH>; chomp $baseStr; while ( <$inFH> ) { chomp; $baseStr = flagDiffs( $baseStr, $_ ); } close $inFH or die $!; say for split m{X+}, $baseStr; sub flagDiffs { my( $baseStr, $nextStr ) = @_; my $diff = $baseStr ^ $nextStr; my @posns; push @posns, pos $diff while $diff =~ m{(?=[^\0])}g; substr $baseStr, $_, 1, q{X} for @posns; return $baseStr; }

        The output.

        at gctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctgaat +cctgcggacgacccctctcgtggtcg ttggggctct ccgccctcttctccgcctgccgttc ggc

        Without a 25,000 sequence file to test on I don't know whether this approach will perform but it seems to give the expected results with the sample in your OP. I hope this is helpful.

        Update: Added the command-line XOR example.

        Update 2: There's no need to keep updating the base sequence string as we go, just keep a record, as keys of a hash, of where differences are found and modify it after all lines have been processed. Also there's no need to chomp each line (as long as all the sequences are the same length), just do it at the end to the base string. If sequences differ in length then you are into pre-processing to find the shortest or longest sequence then either truncating to the shortest or padding to the longest. New code, the output is the same.

        use strict; use warnings; use 5.014; open my $inFH, q{<}, \ <<EOD or die $!; atggctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgcttggggctctgccgccctcttctccgcctgccgttccgg +c atagctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgtttggggctctgccgccctcttctccgcctgccgttcagg +c atggctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgcttggggctctgccgccctcttctccgcctgccgttccgg +c atggctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgcttggggctctaccgccctcttctccgcctgccgttccgg +c EOD my $baseStr = <$inFH>; my %diffPosns; while ( <$inFH> ) { findDiffPosns( $baseStr, $_ ); } close $inFH or die $!; chomp $baseStr; substr $baseStr, $_, 1, q{X} for keys %diffPosns; say for split m{X+}, $baseStr; sub findDiffPosns { my( $baseStr, $nextStr ) = @_; my $diff = $baseStr ^ $nextStr; $diffPosns{ pos $diff } ++ while $diff =~ m{(?=[^\0])}g; }

        Cheers,

        JohnGG

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1093245]
Approved by boftx
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (7)
As of 2024-04-23 13:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found