note
kcott
<p>G'day [id://11113018|Sofie],</p>
<p>Welcome to the Monastery.</p>
<blockquote>
<em>"I am trying to check if an input DNA sequence only contains nucleotides."</em>
</blockquote>
<p>
That's a good start: you've succinctly stated your main goal.
</p>
<blockquote>
<em>"And if it doesn't I want to print out the position in the sequence where an invalid character was entered."</em>
</blockquote>
<p>
Excellent: you a have a subtask; also succinctly stated.
</p>
<blockquote>
<em>"From title: Find element in array"</em>
</blockquote>
<p>
In my opinion, this is where you started to go wrong.
You decided that you needed to split the entire sequence into individual characters and assign those to an array;
then go back and iterate the entire array checking each individual character.
DNA sequences can be exceptionally long — you may be well aware of this —
and doing all this extra work is completely unnecesssary for your stated goals.
</p>
<p>
Here's a script that does what you want.
I've had to make some guesses about the output as you didn't specify that.
</p>
<code>
#!/usr/bin/env perl
use strict;
use warnings;
my $DNA = <STDIN>;
chomp($DNA);
my $lengthseq = length $DNA;
print "The length of the sequence is: $lengthseq\n";
my (@nucleotideDNA, @nonvalid);
for my $pos (0 .. $lengthseq - 1) {
my $nucleotide = substr $DNA, $pos, 1;
if ($nucleotide =~ /^[ACGT]$/) {
push @nucleotideDNA, $pos+1 . ":\t$nucleotide";
}
else {
push @nonvalid, $pos+1 . ":\t$nucleotide";
}
}
print "*** nucleotideDNA ***\n";
print "$_\n" for @nucleotideDNA;
print "*** nonvalid ***\n";
print "$_\n" for @nonvalid;
</code>
<p>Here's a sample run:</p>
<code>
$ ./pm_11113020_parse_dna.pl
XACGTYTGCAZ
The length of the sequence is: 11
*** nucleotideDNA ***
2: A
3: C
4: G
5: T
7: T
8: G
9: C
10: A
*** nonvalid ***
1: X
6: Y
11: Z
</code>
<p>
You may have noticed that I've structured my code in a similar way to yours. Let's look at the differences.
</p>
<ul>
<li>
The shebang line, <c>#!...</c> on line one, can be written in various ways;
you can read more about that in "[https://perldoc.perl.org/perlrun.html|perlrun]".
You'll note that I do not have the "<c>-w</c>" command switch at the end and I recommend that you don't use it either:
see "[https://perldoc.perl.org/perlrun.html#*-w*|perlrun: Command Switches: -w]" for more about that.
</li>
<li>
Next you'll see I've used the "[https://perldoc.perl.org/5.30.0/strict.html|strict]" and
"[https://perldoc.perl.org/5.30.0/warnings.html|warnings]" pragmata.
You should put those two lines at the top of all your code.
See "[https://perldoc.perl.org/perlintro.html#Safety-net|perlintro: Safety net]" for more about that.
</li>
<li>
The next couple of lines are almost identical except that I've used
"[https://perldoc.perl.org/5.30.0/functions/my.html|my]" to declare the <c>$DNA</c> variable.
If you look down the code, you'll see I've declared all variables the same way.
See "[https://perldoc.perl.org/perlintro.html#Perl-variable-types|perlintro: Perl variable types]" for more.
</li>
<li>
I've then skipped creation of the <c>@DNA</c> array, as already discussed; got the length using the
"[https://perldoc.perl.org/5.30.0/functions/length.html|length]" function; then printed the result.
Note how I've <em>interpolated</em> <c>$lengthseq</c> into the print string.
</li>
<li>
Next, I've declared two array variables in one statement.
There's no need to initialise an array; although, some people like to do that —
if you do, use an empty list '<c>()</c>', not a zero-length string '<c>""</c>'.
</li>
<li>
Instead of looping through all of the elements of an array, I loop through a range of numbers
using the range operator, '<c>..</c>'.
See "[https://perldoc.perl.org/perlop.html#Range-Operators|perlop: Range Operators]" for more on that.
</li>
<li>
I access each (potential) nucleotide using "[https://perldoc.perl.org/5.30.0/functions/substr.html|substr]".
</li>
<li>
My regex is almost identical to yours except I've omitted the '<c>+</c>':
that indicates matching one or more characters and, in each iteration, there's only one character.
Use of anchors, '<c>^</c>' and '<c>$</c>' in this case, is good;
as a general rule, it will make your regexes more efficient.
</li>
<li>
I've pushed nucleotides onto arrays as you did.
I also included the sequence position — note that's one more (<c>$pos+1</c>) than the string position (<c>$pos</c>).
As already stated, I made some guesses here because you didn't say exactly what you wanted.
</li>
<li>
Lastly, a series a print statements just shows the results.
You'll probably want something different here.
</li>
</ul>
<blockquote>
<em>"... I am very new to perl ..."</em>
</blockquote>
<p>
That's fine, we all started knowing nothing about Perl.
Note that <em>Perl</em> is the language and <c>perl</c> is the program.
</p>
<p>
I recommend you read through "[https://perldoc.perl.org/perlintro.html|perlintro]" and bookmark that page.
There's no need to try and learn it all in one sitting; just get a general feel for what it has to offer.
It is peppered with links to FAQs, tutorials and more detailed information.
Refer back to it whenever the need arises.
</p>
<small>
<p>
Finally, in case you had some genuine, but <em>unstated</em>, reason to use an array, you could have iterated it like this:
</p>
<code>
for my $pos (0 .. $#DNA) { ... }
</code>
<p>
Then accessed each element with <c>$DNA[$pos]</c> and reported the position with <c>$pos+1</c> as I did.
</p>
<p>
Using the range operator (<c>..</c>) is a standard way to do this:
see "[https://perldoc.perl.org/perlop.html#Range-Operators|perlop: Range Operators]" for details.
</p>
<p>
I don't think that's what you wanted, or needed, here.
You've at least learned how to do this in a more appropriate scenario at some other time.
</p>
</small>
<!-- Node text goes above. Div tags should contain sig only -->
<div class="pmsig"><div class="pmsig-861371">
<p>— Ken</p>
</div></div>
11113020
11113020