in reply to Find element in array
G'day Sofie,
Welcome to the Monastery.
"I am trying to check if an input DNA sequence only contains nucleotides."
That's a good start: you've succinctly stated your main goal.
"And if it doesn't I want to print out the position in the sequence where an invalid character was entered."
Excellent: you a have a subtask; also succinctly stated.
"From title: Find element in array"
In my opinion, this is where you started to go wrong. You decided that you needed to split the entire sequence into individual characters and assign those to an array; then go back and iterate the entire array checking each individual character. DNA sequences can be exceptionally long — you may be well aware of this — and doing all this extra work is completely unnecesssary for your stated goals.
Here's a script that does what you want. I've had to make some guesses about the output as you didn't specify that.
#!/usr/bin/env perl use strict; use warnings; my $DNA = <STDIN>; chomp($DNA); my $lengthseq = length $DNA; print "The length of the sequence is: $lengthseq\n"; my (@nucleotideDNA, @nonvalid); for my $pos (0 .. $lengthseq - 1) { my $nucleotide = substr $DNA, $pos, 1; if ($nucleotide =~ /^[ACGT]$/) { push @nucleotideDNA, $pos+1 . ":\t$nucleotide"; } else { push @nonvalid, $pos+1 . ":\t$nucleotide"; } } print "*** nucleotideDNA ***\n"; print "$_\n" for @nucleotideDNA; print "*** nonvalid ***\n"; print "$_\n" for @nonvalid;
Here's a sample run:
$ ./pm_11113020_parse_dna.pl XACGTYTGCAZ The length of the sequence is: 11 *** nucleotideDNA *** 2: A 3: C 4: G 5: T 7: T 8: G 9: C 10: A *** nonvalid *** 1: X 6: Y 11: Z
You may have noticed that I've structured my code in a similar way to yours. Let's look at the differences.
- The shebang line, #!... on line one, can be written in various ways; you can read more about that in "perlrun". You'll note that I do not have the "-w" command switch at the end and I recommend that you don't use it either: see "perlrun: Command Switches: -w" for more about that.
- Next you'll see I've used the "strict" and "warnings" pragmata. You should put those two lines at the top of all your code. See "perlintro: Safety net" for more about that.
- The next couple of lines are almost identical except that I've used "my" to declare the $DNA variable. If you look down the code, you'll see I've declared all variables the same way. See "perlintro: Perl variable types" for more.
- I've then skipped creation of the @DNA array, as already discussed; got the length using the "length" function; then printed the result. Note how I've interpolated $lengthseq into the print string.
- Next, I've declared two array variables in one statement. There's no need to initialise an array; although, some people like to do that — if you do, use an empty list '()', not a zero-length string '""'.
- Instead of looping through all of the elements of an array, I loop through a range of numbers using the range operator, '..'. See "perlop: Range Operators" for more on that.
- I access each (potential) nucleotide using "substr".
- My regex is almost identical to yours except I've omitted the '+': that indicates matching one or more characters and, in each iteration, there's only one character. Use of anchors, '^' and '$' in this case, is good; as a general rule, it will make your regexes more efficient.
- I've pushed nucleotides onto arrays as you did. I also included the sequence position — note that's one more ($pos+1) than the string position ($pos). As already stated, I made some guesses here because you didn't say exactly what you wanted.
- Lastly, a series a print statements just shows the results. You'll probably want something different here.
"... I am very new to perl ..."
That's fine, we all started knowing nothing about Perl. Note that Perl is the language and perl is the program.
I recommend you read through "perlintro" and bookmark that page. There's no need to try and learn it all in one sitting; just get a general feel for what it has to offer. It is peppered with links to FAQs, tutorials and more detailed information. Refer back to it whenever the need arises.
Finally, in case you had some genuine, but unstated, reason to use an array, you could have iterated it like this:
for my $pos (0 .. $#DNA) { ... }
Then accessed each element with $DNA[$pos] and reported the position with $pos+1 as I did.
Using the range operator (..) is a standard way to do this: see "perlop: Range Operators" for details.
I don't think that's what you wanted, or needed, here. You've at least learned how to do this in a more appropriate scenario at some other time.
— Ken
|
---|