Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

unique sequences

by Anonymous Monk
on Dec 10, 2017 at 23:31 UTC ( [id://1205271]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am really new to perl and am taking a course on it. I wrote the following program for an assignment and am getting the incorrect output. I'm getting over a million lines while the expected output is closer to 250,000. The last 12 nts need to be unique to the genome. I have a feeling it's due to my regex. Any advice would be greatly appreciated. Thankyou.

#!/usr/bin/perl use strict; use warnings; my %windowSeqScore = (); my $input_file = '/scratch/Drosophila/dmel-all-chromosome-r6.02.fasta' +; my $sequenceRef = loadSequence($input_file); my $output_file = 'unique12KmersEndingGG.fasta'; open (KMERS,">", $output_file) or die $!; my $windowSize = 21; my $stepSize = 1; for ( my $windowStart = 0 ; $windowStart <= ( length ( $$sequenceRef ) + - $windowSize ); $windowStart += $stepSize ) { my $windowSeq = substr ( $$sequenceRef, $windowStart, $windowS +ize); if ($windowSeq =~ /([ATCG]{10}GG$)/) { $windowSeqScore{$windowSeq}++; } } my $count = 0; for (keys %windowSeqScore){ $count ++; if ($windowSeqScore{$_} == 1 ) { print KMERS ">crispr_$count", "\n", $_, "\n"; } } sub loadSequence { my ($sequenceFile) = @_; my $sequence = ""; unless ( open( FASTA, "<", $sequenceFile ) ) { die $!; } while (<FASTA>){ my $line = $_; chomp ($line); if ($line !~ /^>/ ) { $sequence .= $line; } } return \$sequence; }

This is some of the output I'm getting

>crispr_1 ACAACAATAATGCGACGATGG >crispr_2 TCCGAAGTCTGCCACTTTAGG >crispr_3 TGATTCCCGATGCAGTGGGGG >crispr_4 GTGGGACGACTGGACAAGTGG >crispr_5 GCCGAAGGAACAACACACAGG >crispr_7 CAAAGTCACTGTCTACGCAGG >crispr_8 ATCATTTGCTACCAGAAATGG >crispr_9 ATCCTGCCTGGCAGCCGGAGG >crispr_10 CCCTTGATCATGATAAATGGG >crispr_11 AACAACTAACTCATTTTGTGG >crispr_12 TTCCCAGCGGGGAAAAAATGG >crispr_13 TCAAGAAAGATTTCCAAAAGG >crispr_14 CCATGCGAGAAATCGCGCAGG >crispr_15 GCTGCTCAAACTGGAACTTGG

this is some of the expected output I should be getting

>crispr_1 TTTAGACTCCCCTTGTACAGG >crispr_2 TCTTCAGTCTCCAGTCTCCGG >crispr_3 TTGCGTTGCGGAGCATACTGG >crispr_4 TGCCACCAGTGGTTCCAAGGG >crispr_5 TTATGTTTGTACGAGGGGGGG >crispr_6 TCTCTTTGGTTTACGGATGGG >crispr_7 TTGGCAAGGAGACGGTCCTGG >crispr_8 TGAATTAAAGCTTGCGCGAGG >crispr_9 GGAAGAGGCATCAACGAGGGG >crispr_10 TGCAGCGGCCTAACAAGGCGG >crispr_11 CTGCCCGATCCTAACTCCAGG >crispr_12 ATATATGTTTGACCGTCGGGG >crispr_13 GGAAACAAAAGCCTATGCGGG >crispr_14 TGCGATCAGGTGTATCCGAGG

Replies are listed 'Best First'.
Re: unique sequences
by kcott (Archbishop) on Dec 11, 2017 at 07:31 UTC

    It's impossible to tell why your actual output is wrong and your expected output is right. Please read "How do I post a question effectively?" for guidelines on what information to provide.

    There doesn't appear to be anything fundamentally wrong with you regex. However, you're giving the regex engine more work to do with the character class "[ATCG]": those are the only characters that will appear in a DNA sequence, so just "." would be sufficient (and the regex engine doesn't need check if every character exists in the provided list).

    With the missing information in your post — particularly the absence of input data — it's difficult to know where you've gone wrong. I suspect concatenating all the sequences is probably a bad idea; instead, considering dealing with each sequence in turn.

    The following script, "pm_1205271_bio_fasta_extract.pl", is intended to provide some techniques you may find useful.

    #!/usr/bin/env perl use strict; use warnings; use autodie; my ($fasta_in, $fasta_out) = qw{ pm_1205271_bio_fasta_extract_dummy_in.fasta pm_1205271_bio_fasta_extract_dummy_out.fasta }; my $re = qr[(.{10}GG)\z]; { open my $in_fh, '<', $fasta_in; open my $out_fh, '>', $fasta_out; local $/ = "\n>"; while (<$in_fh>) { chomp; substr $_, 0, 1, '' if $. == 1; my ($head, $seq) = split /\n/; next unless $seq =~ $re; next if index($seq, $1) < length($seq) - length($1); print $out_fh ">$head\n$seq\n"; } }

    Here's a sample run with some dummy data:

    $ cat pm_1205271_bio_fasta_extract_dummy_in.fasta >1: not wanted - too short xxxGG >2: wanted - exact match xxxxxxxxxxGG >3: not wanted - not unique xxxxxxxxxxGGxxxxxxxxxxGG >4: wanted - unique (but only just) xxxxxxxxxGGxxxxxxxxxxGG >5: not wanted - no GG at end xxxxxxxxxxGGx >6: not wanted - no match at all xxxxxxxxxxAA >7: wanted - match unique yyyyyyyyyyGGxxxxxxxxxxGG $ pm_1205271_bio_fasta_extract.pl $ cat pm_1205271_bio_fasta_extract_dummy_out.fasta >2: wanted - exact match xxxxxxxxxxGG >4: wanted - unique (but only just) xxxxxxxxxGGxxxxxxxxxxGG >7: wanted - match unique yyyyyyyyyyGGxxxxxxxxxxGG

    See also: autodie; perlre (you may want to follow the links at the start of that to introductory material); perlvar (for "$/" and "$."); and, Perl functions A-Z (for any functions you're unfamiliar with).

    I'd also recommend you use lexical filehandles (see open); using package variables can result in bugs which are hard to track down (especially when you start writing code that's more substantial than a short assignment script).

    — Ken

      I'd say the best place for [AGCT] regex check is as close as possible to reading the input file. The less unvalidated data inside the perimeter, the better.

        In general, I'd agree with that. However, biological data can be huge and the least amount of processing you can get away with, the better.

        I'd probably recommend validating the fasta file once, then setting it to "read-only". Perhaps additional checks to ensure it hasn't changed since validation might be in order. It can then be used multiple times with some reasonable degree of confidence about the data integrity.

        Obviously, at this point, we don't know the source of the input, or even what it looks like, so validation requirements are purely guesswork.

        — Ken

Re: unique sequences
by Dallaylaen (Chaplain) on Dec 11, 2017 at 01:42 UTC

    It's hard to tell from the source what the input looks like, and what should be done...

    Do I understand it right that
    1) Input has [ACGT] sequences + some comment lines starting with >?
    2) You need to find strings of 12 nucleotides ending in a GG that occur in the input only once?

    It would be nice if you posted a short input sample (although I guess it's may not generate the needed discrepancy).

      These kind of dmel-all-chromosome files can be found here:

      ftp://ftp.flybase.net/releases/current/dmel_r6.18/fasta/

      dmel-all-chromosome-r6.18.fasta.gz

      That's a few releases later but I suppose that doesn't matter for the processing solution. (The older releases will be somewhere there as well.)

        That's a few releases later...

        So basically species have releases, and are versioned these days. I for one would love to git revert a mammoth!

Re: unique sequences
by Laurent_R (Canon) on Dec 11, 2017 at 07:19 UTC
    Hi Anonymous Monk,

    it is difficult to guess what you should be obtaining without seeing the input, but the output get is in line with the code you've shown. Your code is basically discarding the "comments" (that's how I call the lines starting with >, for lack of a better description) and then looks for sequences of ten nucleotides (I hope this is the right term) followed by GG. And that's pretty much what you have in your output. So, to me, you get what you ask for.

    Please explain in plain English what you need to extract and in which respect the output you get is not what you want or need.

    As a side note, it may or may not be relevant or important, but please remember that a hash does not preserve the order in which the data were populated into it.

      The name of the file-handle suggests that the input is in FASTA format. The reference article indicates that a file may contain more than one sequence. Each sequence is prefixed with a '>' line which specifies its name and may contain comments. If the input contains more than one sequence, the script combines them. If the file is known to contain only one sequence, it is overkill to test every line for '>'.
      Bill
Re: unique sequences
by jwkrahn (Abbot) on Dec 11, 2017 at 09:24 UTC

    It looks like you need something like this:

    #!/usr/bin/perl use strict; use warnings; my $input_file = '/scratch/Drosophila/dmel-all-chromosome-r6.02.fasta +'; my $output_file = 'unique12KmersEndingGG.fasta'; open my $FASTA, '<', $input_file or die "Cannot open '$input_file' be +cause: $!"; open my $KMERS, '>', $output_file or die "Cannot open '$output_file' b +ecause: $!"; my ( $count, %unique_data ); while ( my $line = <$FASTA> ) { next if $line =~ /^>/; while ( $line =~ / ( .{9} [ATCG]{10} G \K G ) /gsx ) { print $KMERS '>crispr_', ++$count "\n$1\n" unless $unique_data +{ $1 }++; } }

      I don't understand the purpose of the  \K operator in the
          / ( .{9} [ATCG]{10} G \K G ) /gsx
      regex in your code. It doesn't seem to have any effect except WRT $& and ${^MATCH}, which aren't used in the code:

      c:\@Work\Perl>perl -wMstrict -MData::Dump -le "my $line = 'AAAATTTTCCCCGGGGAAAGGxAAAACCCCTTTTGGGGAAAGGxTTTTAAAACCCCGGGGAAAGG' ; my %unique_data; my $count; while ( $line =~ / ( .{9} [ATCG]{10} G \K G ) /gsxp ) { print qq{>crispr_@{[ ++$count ]} '$1' ($&) (${^MATCH})} unless $unique_data{$1}++; } ;; dd \%unique_data; " >crispr_1 'AAAATTTTCCCCGGGGAAAGG' (G) (G) >crispr_2 'AAAACCCCTTTTGGGGAAAGG' (G) (G) >crispr_3 'TTTTAAAACCCCGGGGAAAGG' (G) (G) { AAAACCCCTTTTGGGGAAAGG => 1, AAAATTTTCCCCGGGGAAAGG => 1, TTTTAAAACCCCGGGGAAAGG => 1, }
      Can you please give me some insight on this? Is  \K simply a leftover from a previous version of the code?


      Give a man a fish:  <%-{-{-{-<

        Sorry, I didn't test the code well enough. I thought the capturing parenthesis would capture the look-behind pattern as well. This code will do what I originally intended:

        Updated (thanks Cristoforo)

        while ( $line =~ / (?<= .{9} [ATCG]{10} G ) G /gsx ) { my $match = substr $line, $+[ 0 ] - 21, 21; print $KMERS '>crispr_', ++$count, "\n$match\n" unless $unique_data{ $match }++; }
Re: unique sequences
by Marshall (Canon) on Dec 12, 2017 at 00:51 UTC
    I don't see any need to create a hash table.
    Why do you think a hash table is necessary?
    Yes, I think your regex is flawed because of this substr "window". What's up with that?

    I downloaded your data file and I present one way to parse it into records.
    The first record is so long that I omitted it.

    Please take one of these records and show us the desired output.

Re: unique sequences
by Cristoforo (Curate) on Dec 12, 2017 at 21:35 UTC
    Hello,

    I used BioPerl, Bio::SeqIO to load the fasta sequences all into a long string. Then I tested for kmers that only appeared once and printed them out.

    The results agree with your desired output.

    #!/usr/bin/perl use strict; use warnings; use Bio::SeqIO; my $in = Bio::SeqIO->new( -file => "dmel-all-chromosome-r6.18.fasta +" , -format => 'fasta'); my $fasta; while ( my $seq = $in->next_seq() ) { $fasta .= $seq->seq; } my %seen; for my $i (0 .. length($fasta) - 21) { my $kmer = substr $fasta, $i, 21; next unless substr($kmer, -2) eq 'GG'; my $match = substr($kmer, -12); $seen{$match}{count}++; $seen{$match}{kmer} = $kmer; } my $crispr; for my $key (keys %seen) { next unless $seen{$key}{count} == 1; print "crispr_", ++$crispr, "\n"; print $seen{$key}{kmer}, "\n"; } __END__ *** ouput crispr_1 TTTAGACTCCCCTTGTACAGG crispr_2 TCTTCAGTCTCCAGTCTCCGG crispr_3 TTGCGTTGCGGAGCATACTGG crispr_4 TGCCACCAGTGGTTCCAAGGG crispr_5 TTATGTTTGTACGAGGGGGGG crispr_6 TCTCTTTGGTTTACGGATGGG crispr_7 TTGGCAAGGAGACGGTCCTGG crispr_8 TGAATTAAAGCTTGCGCGAGG crispr_9 GGAAGAGGCATCAACGAGGGG crispr_10 TGCAGCGGCCTAACAAGGCGG crispr_11 CTGCCCGATCCTAACTCCAGG crispr_12 ATATATGTTTGACCGTCGGGG ... crispr_126892 TTGCTTGGCACTAAGGCGGGG crispr_126893 CACCAAAAAGGACTTGCGTGG crispr_126894 GTGCCCCTCACTCATGCGGGG crispr_126895 TAAAAAGCGACGCAGTATTGG crispr_126896 CCGGATTTCTTCGTACAGGGG crispr_126897 GGTGGCTATGCTATGGTACGG crispr_126898 CTGCGTTGATGTTAGGTAGGG crispr_126899 GCTGGGACCCGAATACGTAGG

    This program runs in under 2 minutes with the string of fasta characters about 145M. From your specs it looks like you want to combine the sequences into one string to test for kmers that only occur 1 time.

    Update: Just realized how odd it was for my results to agree with the ones you posted. The order of my results were in the (un)ordered keys from the hash. It is curious why the order I got agreed with the order in your post!

    Also, I think the reason you were getting 1 million results rather than 250,000 is you are testing for uniqueness on the whole 21 char windowsize instead of testing just the last 12 chars (including the 'GG' ending). There would be more unique 21 char kmers than unique 12 char kmers.

Re: unique sequences
by Anonymous Monk on Dec 12, 2017 at 04:09 UTC
Re: unique sequences
by james28909 (Deacon) on Dec 11, 2017 at 02:47 UTC
    oh god please tell me crispr isnt run by a random perl noob :P
      Yup. They are going to kill us all. I mean, they are hacking a software they don't understand with tools they hardly understand and without knowing the programming language. What could possibly go wrong?


      holli

      You can lead your users to water, but alas, you cannot drown them.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1205271]
Approved by Athanasius
Front-paged by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (4)
As of 2024-03-28 21:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found