Re: unique sequences
by kcott (Archbishop) on Dec 11, 2017 at 07:31 UTC
|
It's impossible to tell why your actual output is wrong and your expected output is right.
Please read "How do I post a question effectively?" for guidelines on what information to provide.
There doesn't appear to be anything fundamentally wrong with you regex.
However, you're giving the regex engine more work to do with the character class "[ATCG]":
those are the only characters that will appear in a DNA sequence, so just "." would be sufficient
(and the regex engine doesn't need check if every character exists in the provided list).
With the missing information in your post — particularly the absence of input data —
it's difficult to know where you've gone wrong.
I suspect concatenating all the sequences is probably a bad idea;
instead, considering dealing with each sequence in turn.
The following script, "pm_1205271_bio_fasta_extract.pl",
is intended to provide some techniques you may find useful.
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
my ($fasta_in, $fasta_out) = qw{
pm_1205271_bio_fasta_extract_dummy_in.fasta
pm_1205271_bio_fasta_extract_dummy_out.fasta
};
my $re = qr[(.{10}GG)\z];
{
open my $in_fh, '<', $fasta_in;
open my $out_fh, '>', $fasta_out;
local $/ = "\n>";
while (<$in_fh>) {
chomp;
substr $_, 0, 1, '' if $. == 1;
my ($head, $seq) = split /\n/;
next unless $seq =~ $re;
next if index($seq, $1) < length($seq) - length($1);
print $out_fh ">$head\n$seq\n";
}
}
Here's a sample run with some dummy data:
$ cat pm_1205271_bio_fasta_extract_dummy_in.fasta
>1: not wanted - too short
xxxGG
>2: wanted - exact match
xxxxxxxxxxGG
>3: not wanted - not unique
xxxxxxxxxxGGxxxxxxxxxxGG
>4: wanted - unique (but only just)
xxxxxxxxxGGxxxxxxxxxxGG
>5: not wanted - no GG at end
xxxxxxxxxxGGx
>6: not wanted - no match at all
xxxxxxxxxxAA
>7: wanted - match unique
yyyyyyyyyyGGxxxxxxxxxxGG
$ pm_1205271_bio_fasta_extract.pl
$ cat pm_1205271_bio_fasta_extract_dummy_out.fasta
>2: wanted - exact match
xxxxxxxxxxGG
>4: wanted - unique (but only just)
xxxxxxxxxGGxxxxxxxxxxGG
>7: wanted - match unique
yyyyyyyyyyGGxxxxxxxxxxGG
See also: autodie;
perlre
(you may want to follow the links at the start of that to introductory material);
perlvar (for "$/" and "$.");
and, Perl functions A-Z (for any functions you're unfamiliar with).
I'd also recommend you use lexical filehandles (see open);
using package variables can result in bugs which are hard to track down
(especially when you start writing code that's more substantial than a short assignment script).
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
I'd say the best place for [AGCT] regex check is as close as possible to reading the input file. The less unvalidated data inside the perimeter, the better.
| [reply] [Watch: Dir/Any] |
|
In general, I'd agree with that.
However, biological data can be huge and the least amount of processing you can get away with, the better.
I'd probably recommend validating the fasta file once, then setting it to "read-only".
Perhaps additional checks to ensure it hasn't changed since validation might be in order.
It can then be used multiple times with some reasonable degree of confidence about the data integrity.
Obviously, at this point, we don't know the source of the input, or even what it looks like,
so validation requirements are purely guesswork.
| [reply] [Watch: Dir/Any] |
Re: unique sequences
by Dallaylaen (Chaplain) on Dec 11, 2017 at 01:42 UTC
|
It's hard to tell from the source what the input looks like, and what should be done...
Do I understand it right that
1) Input has [ACGT] sequences + some comment lines starting with >?
2) You need to find strings of 12 nucleotides ending in a GG that occur in the input only once?
It would be nice if you posted a short input sample (although I guess it's may not generate the needed discrepancy).
| [reply] [Watch: Dir/Any] |
|
These kind of dmel-all-chromosome files can be found here:
ftp://ftp.flybase.net/releases/current/dmel_r6.18/fasta/
dmel-all-chromosome-r6.18.fasta.gz
That's a few releases later but I suppose that doesn't matter for the processing solution. (The older releases will be somewhere there as well.)
| [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] |
Re: unique sequences
by Laurent_R (Canon) on Dec 11, 2017 at 07:19 UTC
|
Hi Anonymous Monk,
it is difficult to guess what you should be obtaining without seeing the input, but the output get is in line with the code you've shown. Your code is basically discarding the "comments" (that's how I call the lines starting with >, for lack of a better description) and then looks for sequences of ten nucleotides (I hope this is the right term) followed by GG. And that's pretty much what you have in your output. So, to me, you get what you ask for.
Please explain in plain English what you need to extract and in which respect the output you get is not what you want or need.
As a side note, it may or may not be relevant or important, but please remember that a hash does not preserve the order in which the data were populated into it.
| [reply] [Watch: Dir/Any] |
|
The name of the file-handle suggests that the input is in FASTA format. The reference article indicates that a file may contain more than one sequence. Each sequence is prefixed with a '>' line which specifies its name and may contain comments. If the input contains more than one sequence, the script combines them. If the file is known to contain only one sequence, it is overkill to test every line for '>'.
| [reply] [Watch: Dir/Any] |
Re: unique sequences
by jwkrahn (Abbot) on Dec 11, 2017 at 09:24 UTC
|
#!/usr/bin/perl
use strict;
use warnings;
my $input_file = '/scratch/Drosophila/dmel-all-chromosome-r6.02.fasta
+';
my $output_file = 'unique12KmersEndingGG.fasta';
open my $FASTA, '<', $input_file or die "Cannot open '$input_file' be
+cause: $!";
open my $KMERS, '>', $output_file or die "Cannot open '$output_file' b
+ecause: $!";
my ( $count, %unique_data );
while ( my $line = <$FASTA> ) {
next if $line =~ /^>/;
while ( $line =~ / ( .{9} [ATCG]{10} G \K G ) /gsx ) {
print $KMERS '>crispr_', ++$count "\n$1\n" unless $unique_data
+{ $1 }++;
}
}
| [reply] [Watch: Dir/Any] [d/l] |
|
c:\@Work\Perl>perl -wMstrict -MData::Dump -le
"my $line =
'AAAATTTTCCCCGGGGAAAGGxAAAACCCCTTTTGGGGAAAGGxTTTTAAAACCCCGGGGAAAGG'
;
my %unique_data;
my $count;
while ( $line =~ / ( .{9} [ATCG]{10} G \K G ) /gsxp ) {
print qq{>crispr_@{[ ++$count ]} '$1' ($&) (${^MATCH})}
unless $unique_data{$1}++;
}
;;
dd \%unique_data;
"
>crispr_1 'AAAATTTTCCCCGGGGAAAGG' (G) (G)
>crispr_2 'AAAACCCCTTTTGGGGAAAGG' (G) (G)
>crispr_3 'TTTTAAAACCCCGGGGAAAGG' (G) (G)
{
AAAACCCCTTTTGGGGAAAGG => 1,
AAAATTTTCCCCGGGGAAAGG => 1,
TTTTAAAACCCCGGGGAAAGG => 1,
}
Can you please give me some insight on this? Is \K simply a leftover from a previous version of the code?
Give a man a fish: <%-{-{-{-<
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
Sorry, I didn't test the code well enough. I thought the capturing parenthesis would capture the look-behind pattern as well. This code will do what I originally intended:
Updated (thanks Cristoforo)
while ( $line =~ / (?<= .{9} [ATCG]{10} G ) G /gsx ) {
my $match = substr $line, $+[ 0 ] - 21, 21;
print $KMERS '>crispr_', ++$count, "\n$match\n"
unless $unique_data{ $match }++;
}
| [reply] [Watch: Dir/Any] [d/l] |
|
|
Re: unique sequences
by Marshall (Canon) on Dec 12, 2017 at 00:51 UTC
|
I don't see any need to create a hash table.
Why do you think a hash table is necessary?
Yes, I think your regex is flawed because of this substr "window". What's up with that?
I downloaded your data file and I present one way to parse it into records.
The first record is so long that I omitted it.
Please take one of these records and show us the desired output.
| [reply] [Watch: Dir/Any] [d/l] |
Re: unique sequences
by Cristoforo (Curate) on Dec 12, 2017 at 21:35 UTC
|
Hello,
I used BioPerl, Bio::SeqIO to load the fasta sequences all into a long string. Then I tested for kmers that only appeared once and printed them out.
The results agree with your desired output.
#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
my $in = Bio::SeqIO->new( -file => "dmel-all-chromosome-r6.18.fasta
+" ,
-format => 'fasta');
my $fasta;
while ( my $seq = $in->next_seq() ) {
$fasta .= $seq->seq;
}
my %seen;
for my $i (0 .. length($fasta) - 21) {
my $kmer = substr $fasta, $i, 21;
next unless substr($kmer, -2) eq 'GG';
my $match = substr($kmer, -12);
$seen{$match}{count}++;
$seen{$match}{kmer} = $kmer;
}
my $crispr;
for my $key (keys %seen) {
next unless $seen{$key}{count} == 1;
print "crispr_", ++$crispr, "\n";
print $seen{$key}{kmer}, "\n";
}
__END__
*** ouput
crispr_1
TTTAGACTCCCCTTGTACAGG
crispr_2
TCTTCAGTCTCCAGTCTCCGG
crispr_3
TTGCGTTGCGGAGCATACTGG
crispr_4
TGCCACCAGTGGTTCCAAGGG
crispr_5
TTATGTTTGTACGAGGGGGGG
crispr_6
TCTCTTTGGTTTACGGATGGG
crispr_7
TTGGCAAGGAGACGGTCCTGG
crispr_8
TGAATTAAAGCTTGCGCGAGG
crispr_9
GGAAGAGGCATCAACGAGGGG
crispr_10
TGCAGCGGCCTAACAAGGCGG
crispr_11
CTGCCCGATCCTAACTCCAGG
crispr_12
ATATATGTTTGACCGTCGGGG
...
crispr_126892
TTGCTTGGCACTAAGGCGGGG
crispr_126893
CACCAAAAAGGACTTGCGTGG
crispr_126894
GTGCCCCTCACTCATGCGGGG
crispr_126895
TAAAAAGCGACGCAGTATTGG
crispr_126896
CCGGATTTCTTCGTACAGGGG
crispr_126897
GGTGGCTATGCTATGGTACGG
crispr_126898
CTGCGTTGATGTTAGGTAGGG
crispr_126899
GCTGGGACCCGAATACGTAGG
This program runs in under 2 minutes with the string of fasta characters about 145M. From your specs it looks like you want to combine the sequences into one string to test for kmers that only occur 1 time.
Update: Just realized how odd it was for my results to agree with the ones you posted. The order of my results were in the (un)ordered keys from the hash. It is curious why the order I got agreed with the order in your post!
Also, I think the reason you were getting 1 million results rather than 250,000 is you are testing for uniqueness on the whole 21 char windowsize instead of testing just the last 12 chars (including the 'GG' ending). There would be more unique 21 char kmers than unique 12 char kmers. | [reply] [Watch: Dir/Any] [d/l] |
Re: unique sequences
by Anonymous Monk on Dec 12, 2017 at 04:09 UTC
|
| [reply] [Watch: Dir/Any] |
Re: unique sequences
by james28909 (Deacon) on Dec 11, 2017 at 02:47 UTC
|
oh god please tell me crispr isnt run by a random perl noob :P | [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] |