http://qs321.pair.com?node_id=962403


in reply to find index of specific array value that occurs multiple times

Your problem is complicated by the use of spaces in the consensus lines. If we simply split on whitespace, then we would get an incorrect index/offset.

Working, tested code:

use strict; use warnings; push @ARGV, 'pm_962355_01.dat' if not @ARGV; my @consensus_lines; my $seq_start_column = 0; while (<>) { chomp; die "This algorithm relies on spaces - no tabs allowed!" if /\t/; if ( !$seq_start_column and /^(\d+\.\S+\s+)\S/ ) { $seq_start_column = length $1; } next if /^CLUSTAL/; # Header row next if !/\S/; # Blank rows if ( /\*/ ) { push @consensus_lines, $_; } } if (!$seq_start_column) { die "Failed to calculate start column for sequences"; } my $consensus = join '', map { substr $_, $seq_start_column } @consensus_lines; # Just for debugging use Data::Dumper; $Data::Dumper::Useqq = 1; print Dumper $consensus; my @indexes = 0..length($consensus); my @index_c = grep { substr($consensus,$_,1) eq '*' } @indexes; my @index_n = grep { substr($consensus,$_,1) ne '*' } @indexes; # If positions are 0-based: print " Conserved: ", join(',', @index_c), "\n"; print "Not conserved: ", join(',', @index_n), "\n"; # If positions are 1-based: #my @offset_c = map { $_ + 1 } @index_c; #my @offset_n = map { $_ + 1 } @index_n; #print " Conserved: ", join(',', @offset_c), "\n"; #print "Not conserved: ", join(',', @offset_n), "\n";
Output:
$VAR1 = " :*** * .** *::***** **:::::**: ::*:*** :*****:*.** +*******: .:* * **:.*.. *****.***:***:: ..*: **.* ****:************* +"; Conserved: 9,10,11,14,19,20,22,25,26,27,28,29,31,32,38,39,44,46,47 +,48,51,52,53,54,55,57,59,60,61,62,63,64,65,66,67,72,74,78,79,82,86,87 +,88,89,90,92,93,94,96,97,98,104,107,108,110,112,113,114,115,117,118,1 +19,120,121,122,123,124,125,126,127,128,129 Not conserved: 0,1,2,3,4,5,6,7,8,12,13,15,16,17,18,21,23,24,30,33,34,3 +5,36,37,40,41,42,43,45,49,50,56,58,68,69,70,71,73,75,76,77,80,81,83,8 +4,85,91,95,99,100,101,102,103,105,106,109,111,116,130

Replies are listed 'Best First'.
Re^2: find index of specific array value that occurs multiple times
by AWallBuilder (Beadle) on Mar 29, 2012 at 14:31 UTC

    Thank you. As you said this works. Can you please explain a few lines to me.

    my @indexes = 0..length($consensus); my @index_c = grep { substr($consensus,$_,1) eq '*' } @indexes; my @index_n = grep { substr($consensus,$_,1) ne '*' } @indexes;

      There's a bug... an off by one error on the last iteration: it should read length( $consensus ) - 1;

      Actually they could be reduced to a single loop like this:

      use strict; use warnings; my $consensus = 'abc*def*ghi'; my( @index_c, @index_n ); my $idx = 0; my $len = length( $consensus ); push( @{ substr( $consensus, $idx, 1 ) eq '*' ? \@index_c : \@index_n +}, $idx++ ) while $idx < $len; print "C: @index_c\n"; print "N: @index_n\n";

      This trades three loops (two explicit -- grep, and one implicit -- the initialization of @indexes) for a single while loop. It also reduces the calls to substr in half.

      Explaining the original (grep) method: Create an array that contains the indices of each position within the string $consensus. Then iterate over the array checking the character in $consensus at each position for the existence of the character '*'. If the condition is true ('*' exists in that position), place that index position in @index_c. Repeat the process testing for non-existence of '*', placing all indices where '*' doesn't exist into @index_n.

      Explaining my refactor: Create some empty index holders (@index_c, @index_n). Create a counter variable and a variable that holds the length of $consensus ( $idx, $len ). Then we take the length of $consensus just one time and store it (rather than call length in the while loop conditional). Next iterate over the $idx values from 0 through the last index position of $consensus ($idx++ < $len). At each position use substr to access and test whether the character at position $idx contains a '*'. The ternary operator exposes either a reference to @index_c or a reference to @index_n to the @{...} dereference construct, creating an lvalue. We then push $idx (the current position) onto either @index_c or @index_n depending on the outcome of the conditional test. Repeat this process for each character position from 0 through $len-1.

      That changes the algorithm efficiency from three O(n) loops to one O(n) loop. However, it does sacrifice coding clarity, which could be restored like this:

      my $consensus = 'abc*def*ghi'; my( @index_c, @index_n ); my $idx = 0; my $len = length( $consensus ); while( $idx < $len ) { if( substr( $consensus, $idx, 1 ) eq '*' ) { push @index_c, $idx; } else { push @index_n, $idx; } $idx++; } print "C: @index_c\n"; print "N: @index_n\n";

      Dave