http://qs321.pair.com?node_id=11113027


in reply to Find element in array

Does this do what you want? There is no need to split the sequence into an array as pos will allow you to find where in a string a match has been made. Note that [^ACGT] is a negative character class, i.e. match anything that isn't A, C, G or T. Using capturing parentheses, ( ... ), and matching globally, m{ ... }g or / ... /g will advance along the sequence looking for invalid letters.

I am opening a file that is held inside the script just to keep things tidy on my system but the code will work fine with STDIN. The code.

use 5.026; use warnings; open my $dnaFH, q{<}, \ <<__EOD__ or die $!; TAAGAACAATAAGAACAAGAACAATAA GAACAATAAGXAATAAGAAXXAACAAGAACAATAA ACAATAAAAGAACAATAAGAA __EOD__ while ( my $sequence = <$dnaFH> ) { chomp $sequence; my $length = length $sequence; say qq{Sequence: $sequence -- Length $length}; if ( $sequence =~ m{^[ACGT]+$} ) { say q{ Sequence is GOOD!}; } else { my @badPosns; push @badPosns, pos $sequence while $sequence =~ m{(?x) (?= ( [^ACGT] ) )}g; my $nBad = scalar @badPosns; my $perc = sprintf q{%.2f}, $nBad / $length * 100; say qq{ Sequence is BAD at @badPosns}; say qq{ $nBad bad positions, $perc\% of total}; } } close $dnaFH or die $!;

The output.

Sequence: TAAGAACAATAAGAACAAGAACAATAA -- Length 27 Sequence is GOOD! Sequence: GAACAATAAGXAATAAGAAXXAACAAGAACAATAA -- Length 35 Sequence is BAD at 10 19 20 3 bad positions, 8.57% of total Sequence: ACAATAAAAGAACAATAAGAA -- Length 21 Sequence is GOOD!

I hope this is helpful. Please ask further if you need more help.

Update: There was a mistake in the code, I should have used a look-ahead assertion as without that pos gives the position after the match, not that of the match itself. Added extended syntax ((?x)) to make the regex clearer. My bad :-(

Update 2: I should also have corrected the output, now done.

Cheers,

JohnGG