http://qs321.pair.com?node_id=1205271

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am really new to perl and am taking a course on it. I wrote the following program for an assignment and am getting the incorrect output. I'm getting over a million lines while the expected output is closer to 250,000. The last 12 nts need to be unique to the genome. I have a feeling it's due to my regex. Any advice would be greatly appreciated. Thankyou.

#!/usr/bin/perl use strict; use warnings; my %windowSeqScore = (); my $input_file = '/scratch/Drosophila/dmel-all-chromosome-r6.02.fasta' +; my $sequenceRef = loadSequence($input_file); my $output_file = 'unique12KmersEndingGG.fasta'; open (KMERS,">", $output_file) or die $!; my $windowSize = 21; my $stepSize = 1; for ( my $windowStart = 0 ; $windowStart <= ( length ( $$sequenceRef ) + - $windowSize ); $windowStart += $stepSize ) { my $windowSeq = substr ( $$sequenceRef, $windowStart, $windowS +ize); if ($windowSeq =~ /([ATCG]{10}GG$)/) { $windowSeqScore{$windowSeq}++; } } my $count = 0; for (keys %windowSeqScore){ $count ++; if ($windowSeqScore{$_} == 1 ) { print KMERS ">crispr_$count", "\n", $_, "\n"; } } sub loadSequence { my ($sequenceFile) = @_; my $sequence = ""; unless ( open( FASTA, "<", $sequenceFile ) ) { die $!; } while (<FASTA>){ my $line = $_; chomp ($line); if ($line !~ /^>/ ) { $sequence .= $line; } } return \$sequence; }

This is some of the output I'm getting

>crispr_1 ACAACAATAATGCGACGATGG >crispr_2 TCCGAAGTCTGCCACTTTAGG >crispr_3 TGATTCCCGATGCAGTGGGGG >crispr_4 GTGGGACGACTGGACAAGTGG >crispr_5 GCCGAAGGAACAACACACAGG >crispr_7 CAAAGTCACTGTCTACGCAGG >crispr_8 ATCATTTGCTACCAGAAATGG >crispr_9 ATCCTGCCTGGCAGCCGGAGG >crispr_10 CCCTTGATCATGATAAATGGG >crispr_11 AACAACTAACTCATTTTGTGG >crispr_12 TTCCCAGCGGGGAAAAAATGG >crispr_13 TCAAGAAAGATTTCCAAAAGG >crispr_14 CCATGCGAGAAATCGCGCAGG >crispr_15 GCTGCTCAAACTGGAACTTGG

this is some of the expected output I should be getting

>crispr_1 TTTAGACTCCCCTTGTACAGG >crispr_2 TCTTCAGTCTCCAGTCTCCGG >crispr_3 TTGCGTTGCGGAGCATACTGG >crispr_4 TGCCACCAGTGGTTCCAAGGG >crispr_5 TTATGTTTGTACGAGGGGGGG >crispr_6 TCTCTTTGGTTTACGGATGGG >crispr_7 TTGGCAAGGAGACGGTCCTGG >crispr_8 TGAATTAAAGCTTGCGCGAGG >crispr_9 GGAAGAGGCATCAACGAGGGG >crispr_10 TGCAGCGGCCTAACAAGGCGG >crispr_11 CTGCCCGATCCTAACTCCAGG >crispr_12 ATATATGTTTGACCGTCGGGG >crispr_13 GGAAACAAAAGCCTATGCGGG >crispr_14 TGCGATCAGGTGTATCCGAGG