Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

find index of specific array value that occurs multiple times

by AWallBuilder (Beadle)
on Mar 29, 2012 at 11:54 UTC ( [id://962355]=perlquestion: print w/replies, xml ) Need Help??

AWallBuilder has asked for the wisdom of the Perl Monks concerning the following question:

This is related to a similar post yesterday about alignment data. Unfortunately the alignment data is in a different format so that script will not work, buuut fortunately the consensus (ie. if a letter is the same in all columns) is shown at the bottom by a series of stars and periods. I have a few problems: - the alignment is broken over 3 lines by line breaks, so my consensus appears to be 3 words - I don't know how to read out all of the index positions for which an array element matches something eg. * (I have read a similar post but all of the examples are for finding the index of a match that only occurs once.) below is an example data file and my script.

CLUSTAL O(1.0.3) multiple sequence alignment 435590.150003364 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIISWSSAGKMGFRGSKKNT 226186.29348112 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIISWSSAGKMGFRGSKKNT 295405.53715441 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIISWSSAGKMGFRGSKKNT 272559.60683413 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIISWSSAGKMGFRGSKKNT 411901.149130658 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIISWSSAGKMGFRGSKKNT 411476.156111512 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIISWSSAGKMGFRGSKKNT 411479.156862051 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIISWSSAGKMGFRGSKKNT 449673.167697753 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIISWSSAGKMGFRGSKKNT 470145.189432667 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIISWSSAGKMGFRGSKKNT 483215.260624635 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIISWSSAGKMGFRGSKKNT 483218.217990640 -------MTKKRVKKNVEHGQAHIQSSFNNTIVTLTDA +EGNALSWASAGGLGFRGSKKST 483216.217986519 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIISWSSAGKMGFRGSKKNT 484018.198273057 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIIAWSSAGKMGFRGSKKNT 483217.212662191 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIISWSSAGKMGFRGSKKNT 511680.292809380 MAKVTKKVTKKRVKKNVERGQAHIQSSFNNTIVTITDT +EGNALSWASAGGLGFRGSRKST 537012.224520527 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIISWSSAGKMGFRGSKKNT 547042.224016865 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIISWSSAGKMGFRGSKKNT 469586.251841635 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIISWSSAGKMGFRGSKKNT 469590.229451850 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIISWSSAGKMGFRGSKKNT 457392.251944867 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIISWSSAGKMGFRGSKKNT 457394.254834140 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIISWSSAGKMGFRGSKKNT 457395.229454394 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIISWSSAGKMGFRGSKKNT 556258.229443407 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIISWSSAGKMGFRGSKKNT 556260.229435668 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIISWSSAGKMGFRGSKKNT 469587.263252878 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIISWSSAGKMGFRGSKKNT 469588.262357064 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIISWSSAGKMGFRGSKKNT 457391.263236304 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIISWSSAGKMGFRGSKKNT 585543.270273974 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIISWSSAGKMGFRGSKKNT 702446.294448298 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIISWSSAGKMGFRGSKKNT 702443.292632887 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIISWSSAGKMGFRGSKKNT 702444.292640606 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIISWSSAGKMGFRGSKKNT 702447.294446237 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIISWSSAGKMGFRGSKKNT 657309.locus_tag:BXY_18280 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIISWSSAGKMGFRGSKKNT 457424.EQ973217.G291 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIISWSSAGKMGFRGSKKNT 556259.GG663458.G80 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIISWSSAGKMGFRGSKKNT 471870.DS562360.G288 MAKKTV-AAKKRNVKVDANGQLHVHSSFNNIIVSLANS +EGQIISWSSAGKMGFRGSKKNT :*** * .** *::***** **::::: +**: ::*:*** :*****:*.* 435590.150003364 PYAAQMAAQDCAKVAFDLGLRKVKAYVKGPGNGRESAI +RTVHGAGIEVTEIIDVTPLPHN 226186.29348112 PYAAQMAAQDCAKIAFDLGLRKVKAYVKGPGNGRESAI +RTIHGAGIEVTEIIDVTPLPHN 295405.53715441 PYAAQMAAQDCAKIAFDLGLRKVKAYVKGPGNGRESAI +RTIHGAGIEVTEIIDVTPLPHN 272559.60683413 PYAAQMAAQDCAKIAFDLGLRKVKAYVKGPGNGRESAI +RTIHGAGIEVTEIIDVTPLPHN 411901.149130658 PYAAQMAAQDCAKVAFDLGLRKVKAYVKGPGNGRESAI +RTIHGAGIEVTEIIDVTPLPHN 411476.156111512 PYAAQMAAQDCAKIAFDLGLRKVKAYVKGPGNGRESAI +RTIHGAGIEVTEIIDVTPLPHN 411479.156862051 PYAAQMAAQDCAKIAYDLGLRKVKAYVKGPGNGRESAI +RTIHGAGIEVTEIIDVTPLPHN 449673.167697753 PYAAQMAAQDCAKVAYDLGLRKVKAYVKGPGNGRESAI +RTIHGAGIEVTEIIDVTPLPHN 470145.189432667 PYAAQMAAQDCAKVAYDLGLRKVKAYVKGPGNGRESAI +RTVHGAGIEVTEIIDVTPLPHN 483215.260624635 PYAAQMAAQDCAKIAFDLGLRKVKAYVKGPGNGRESAI +RTIHGAGIEVTEIIDVTPLPHN 483218.217990640 PYAAQMAAETATKAALIHGLKSVDVMVKGPGSGREAAI +RALSAAGLTVTSIKDVTPVPHN 483216.217986519 PYAAQMAAQDCAKIAYDLGLRKVKAYVKGPGNGRESAI +RTIHGAGIEVTEIIDVTPLPHN 484018.198273057 PYAAQMAAQDCAKVAYDLGLRKVKAYVKGPGNGRESAI +RTVHGAGIEVTEIIDVTPLPHN 483217.212662191 PYAAQMAAQDCAKVAFDLGLRKVKAYVKGPGNGRESAI +RTVHGAGIEVTEIIDVTPLPHN 511680.292809380 PYAAQMAAETATKAALIHGLKSVDVMVKGPGSGREAAI +RALQACGLEVTSIKDVTPVPHN 537012.224520527 PYAAQMAAQDCAKIAFDLGLRKVKAYVKGPGNGRESAI +RTIHGAGIEVTEIVDVTPLPHN 547042.224016865 PYAAQMAAQDCAKVAYDLGLRKVKAYVKGPGNGRESAI +RTVHGAGIEVTEIIDVTPLPHN 469586.251841635 PYAAQMAAQDCAKIAFDLGLRKVKAYVKGPGNGRESAI +RTIHGAGIEVTEIIDVTPLPHN 469590.229451850 PYAAQMAAQDCAKIAFDLGLRKVKAYVKGPGNGRESAI +RTIHGAGIEVTEIIDVTPLPHN 457392.251944867 PYAAQMAAQDCAKIAFDLGLRKVKAYVKGPGNGRESAI +RTIHGAGIEVTEIIDVTPLPHN 457394.254834140 PYAAQMAAQDCAKVAFDLGLRKVKAYVKGPGNGRESAI +RTVHGAGIEVTEIIDVTPLPHN 457395.229454394 PYAAQMAAQDCAKVAFDLGLRKVKAYVKGPGNGRESAI +RTVHGAGIEVTEIIDVTPLPHN 556258.229443407 PYAAQMAAQDCAKIAFDLGLRKVKAYVKGPGNGRESAI +RTIHGAGIEVTEIIDVTPLPHN 556260.229435668 PYAAQMAAQDCAKVAFDLGLRKVKAYVKGPGNGRESAI +RTVHGAGIEVTEIIDVTPLPHN 469587.263252878 PYAAQMAAQDCAKIAFDLGLRKVKAYVKGPGNGRESAI +RTIHGAGIEVTEIIDVTPLPHN 469588.262357064 PYAAQMAAQDCAKIAFDLGLRKVKAYVKGPGNGRESAI +RTIHGAGIEVTEIIDVTPLPHN 457391.263236304 PYAAQMAAQDCAKVAFDLGLRKVKAYVKGPGNGRESAI +RTVHGAGIEVTEIIDVTPLPHN 585543.270273974 PYAAQMAAQDCAKIAYDLGLRKVKAYVKGPGNGRESAI +RTIHGAGIEVTEIIDVTPLPHN 702446.294448298 PYAAQMAAQDCAKVAFDLGLRKVKAYVKGPGNGRESAI +RTVHGAGIEVTEIIDVTPLPHN 702443.292632887 PYAAQMAAQDCAKIAFDLGLRKVKAYVKGPGNGRESAI +RTIHGAGIEVTEIIDVTPLPHN 702444.292640606 PYAAQMAAQDCAKIAFDLGLRKVKAYVKGPGNGRESAI +RTIHGAGIEVTEIIDVTPLPHN 702447.294446237 PYAAQMAAQDCAKIAFDLGLRKVKAYVKGPGNGRESAI +RTIHGAGIEVTEIIDVTPLPHN 657309.locus_tag:BXY_18280 PYAAQMAAQDCAKIAFDLGLRKVKAYVKGPGNGRESAI +RTIHGAGIEVTEIIDVTPLPHN 457424.EQ973217.G291 PYAAQMAAQDCAKIAFDLGLRKVKAYVKGPGNGRESAI +RTIHGAGIEVTEIIDVTPLPHN 556259.GG663458.G80 PYAAQMAAQDCAKIAFDLGLRKVKAYVKGPGNGRESAI +RTIHGAGIEVTEIIDVTPLPHN 471870.DS562360.G288 PYAAQMAAQDCAKVAFDLGLRKVKAYVKGPGNGRESAI +RTIHGAGIEVTEIIDVTPLPHN ********: .:* * **:.*.. *****.***:** +*:: ..*: **.* ****:*** 435590.150003364 GCRPPKRRRV 226186.29348112 GCRPPKRRRV 295405.53715441 GCRPPKRRRV 272559.60683413 GCRPPKRRRV 411901.149130658 GCRPPKRRRV 411476.156111512 GCRPPKRRRV 411479.156862051 GCRPPKRRRV 449673.167697753 GCRPPKRRRV 470145.189432667 GCRPPKRRRV 483215.260624635 GCRPPKRRRV 483218.217990640 GCRPPKRRRV 483216.217986519 GCRPPKRRRV 484018.198273057 GCRPPKRRRV 483217.212662191 GCRPPKRRRV 511680.292809380 GCRPPKRRRV 537012.224520527 GCRPPKRRRV 547042.224016865 GCRPPKRRRV 469586.251841635 GCRPPKRRRV 469590.229451850 GCRPPKRRRV 457392.251944867 GCRPPKRRRV 457394.254834140 GCRPPKRRRV 457395.229454394 GCRPPKRRRV 556258.229443407 GCRPPKRRRV 556260.229435668 GCRPPKRRRV 469587.263252878 GCRPPKRRRV 469588.262357064 GCRPPKRRRV 457391.263236304 GCRPPKRRRV 585543.270273974 GCRPPKRRRV 702446.294448298 GCRPPKRRRV 702443.292632887 GCRPPKRRRV 702444.292640606 GCRPPKRRRV 702447.294446237 GCRPPKRRRV 657309.locus_tag:BXY_18280 GCRPPKRRRV 457424.EQ973217.G291 GCRPPKRRRV 556259.GG663458.G80 GCRPPKRRRV 471870.DS562360.G288 GCRPPKRRRV **********
use strict; use warnings; my $clu_align=$ARGV[0]; open(IN,">",$clu_align) or die "can't open file $clu_align"; my @consensus; while(my $line=<IN>){ chomp($line); next if ($line =~ /^CLUSTAL/); # header row next if ($line=~ /^$/); # a few blank rows if ($line =~ /\*/){ push (@consensus,$line); } } close(IN); chomp(@consensus); #foreach my $c (@consensus){ just for debugging # print "consensus\n"; # print "$c\t"; # print "\n"; #} my( @index )= grep { $consensus[$_] eq "*" } 0..$#consensus; my $outfile="$clu_align.VarPositions"; open (OUT,">",$outfile); foreach my $i (@index) { print OUT "$i\tis not conserved\n"; }

Replies are listed 'Best First'.
Re: find index of specific array value that occurs multiple times
by Util (Priest) on Mar 29, 2012 at 12:41 UTC

    open(IN,">",$clu_align)
    You are overwriting your input file!
    Change > to <

      oops - thanks - still not working output is that all is conserved
Re: find index of specific array value that occurs multiple times
by Util (Priest) on Mar 29, 2012 at 13:46 UTC

    Your problem is complicated by the use of spaces in the consensus lines. If we simply split on whitespace, then we would get an incorrect index/offset.

    Working, tested code:

    use strict; use warnings; push @ARGV, 'pm_962355_01.dat' if not @ARGV; my @consensus_lines; my $seq_start_column = 0; while (<>) { chomp; die "This algorithm relies on spaces - no tabs allowed!" if /\t/; if ( !$seq_start_column and /^(\d+\.\S+\s+)\S/ ) { $seq_start_column = length $1; } next if /^CLUSTAL/; # Header row next if !/\S/; # Blank rows if ( /\*/ ) { push @consensus_lines, $_; } } if (!$seq_start_column) { die "Failed to calculate start column for sequences"; } my $consensus = join '', map { substr $_, $seq_start_column } @consensus_lines; # Just for debugging use Data::Dumper; $Data::Dumper::Useqq = 1; print Dumper $consensus; my @indexes = 0..length($consensus); my @index_c = grep { substr($consensus,$_,1) eq '*' } @indexes; my @index_n = grep { substr($consensus,$_,1) ne '*' } @indexes; # If positions are 0-based: print " Conserved: ", join(',', @index_c), "\n"; print "Not conserved: ", join(',', @index_n), "\n"; # If positions are 1-based: #my @offset_c = map { $_ + 1 } @index_c; #my @offset_n = map { $_ + 1 } @index_n; #print " Conserved: ", join(',', @offset_c), "\n"; #print "Not conserved: ", join(',', @offset_n), "\n";
    Output:
    $VAR1 = " :*** * .** *::***** **:::::**: ::*:*** :*****:*.** +*******: .:* * **:.*.. *****.***:***:: ..*: **.* ****:************* +"; Conserved: 9,10,11,14,19,20,22,25,26,27,28,29,31,32,38,39,44,46,47 +,48,51,52,53,54,55,57,59,60,61,62,63,64,65,66,67,72,74,78,79,82,86,87 +,88,89,90,92,93,94,96,97,98,104,107,108,110,112,113,114,115,117,118,1 +19,120,121,122,123,124,125,126,127,128,129 Not conserved: 0,1,2,3,4,5,6,7,8,12,13,15,16,17,18,21,23,24,30,33,34,3 +5,36,37,40,41,42,43,45,49,50,56,58,68,69,70,71,73,75,76,77,80,81,83,8 +4,85,91,95,99,100,101,102,103,105,106,109,111,116,130

      Thank you. As you said this works. Can you please explain a few lines to me.

      my @indexes = 0..length($consensus); my @index_c = grep { substr($consensus,$_,1) eq '*' } @indexes; my @index_n = grep { substr($consensus,$_,1) ne '*' } @indexes;

        There's a bug... an off by one error on the last iteration: it should read length( $consensus ) - 1;

        Actually they could be reduced to a single loop like this:

        use strict; use warnings; my $consensus = 'abc*def*ghi'; my( @index_c, @index_n ); my $idx = 0; my $len = length( $consensus ); push( @{ substr( $consensus, $idx, 1 ) eq '*' ? \@index_c : \@index_n +}, $idx++ ) while $idx < $len; print "C: @index_c\n"; print "N: @index_n\n";

        This trades three loops (two explicit -- grep, and one implicit -- the initialization of @indexes) for a single while loop. It also reduces the calls to substr in half.

        Explaining the original (grep) method: Create an array that contains the indices of each position within the string $consensus. Then iterate over the array checking the character in $consensus at each position for the existence of the character '*'. If the condition is true ('*' exists in that position), place that index position in @index_c. Repeat the process testing for non-existence of '*', placing all indices where '*' doesn't exist into @index_n.

        Explaining my refactor: Create some empty index holders (@index_c, @index_n). Create a counter variable and a variable that holds the length of $consensus ( $idx, $len ). Then we take the length of $consensus just one time and store it (rather than call length in the while loop conditional). Next iterate over the $idx values from 0 through the last index position of $consensus ($idx++ < $len). At each position use substr to access and test whether the character at position $idx contains a '*'. The ternary operator exposes either a reference to @index_c or a reference to @index_n to the @{...} dereference construct, creating an lvalue. We then push $idx (the current position) onto either @index_c or @index_n depending on the outcome of the conditional test. Repeat this process for each character position from 0 through $len-1.

        That changes the algorithm efficiency from three O(n) loops to one O(n) loop. However, it does sacrifice coding clarity, which could be restored like this:

        my $consensus = 'abc*def*ghi'; my( @index_c, @index_n ); my $idx = 0; my $len = length( $consensus ); while( $idx < $len ) { if( substr( $consensus, $idx, 1 ) eq '*' ) { push @index_c, $idx; } else { push @index_n, $idx; } $idx++; } print "C: @index_c\n"; print "N: @index_n\n";

        Dave

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://962355]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (8)
As of 2024-04-18 10:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found