bioinformatics has asked for the wisdom of the Perl Monks concerning the following question:
Hello all
The NCBI database containing sequence information for hundreds of species allows for dl of major portions of their database, allowing for local searches of sequence information rather than having too many people hog the bandwidth. I have dl a specific flat file containing sequence information in fasta format, looking as such:
>gi|2695846|emb|Y13255.1|ABY13255 Acipenser baeri mRNA for immunoglobu
+lin heavy chain, clone ScH 3.3
TGGTTACAACACTTTCTTCTTTCAATAACCACAATACTGCAGTACAATGGGGATTTTAACAGCTCTCTGT
+ATAATAATGA
CAGCTCTATCAAGTGTCCGGTCTGATGTAGTGTTGACTGAGTCCGGACCAGCAGTTATAAAGCCTGGAGA
+GTCCCATAAA
CTGTCCTGTAAAGCCTCTGGATTCACATTCAGCAGCGCCTACATGAGCTGGGTTCGACAAGCTCCTGGAA
+AGGGTCTGGA
ATGGGTGGCTTATATTTACTCAGGTGGTAGTAGTACATACTATGCCCAGTCTGTCCAGGGAAGATTCGCC
+ATCTCCAGAG
ACGATTCCAACAGCATGCTGTATTTACAAATGAACAGCCTGAAGACTGAAGACACTGCCGTGTATTACTG
+TGCTCGGGGC
GGGCTGGGGTGGTCCCTTGACTACTGGGGGAAAGGCACAATGATCACCGTAACTTCTGCTACGCCATCAC
+CACCGACAGT
GTTTCCGCTTATGGAGTCATGTTGTTTGAGCGATATCTCGGGTCCTGTTGCTACGGGCTGCTTAGCAACC
+GGATTCTGCC
TACCCCCGCGACCTTCTCGTGGACTGATCAATCTGGAAAAGCTTTT
This flat file is composed of consecutive fasta sequences (like the one above), and approaches 9 gig in size. What I am trying to do is slurp the entire file into an array (crazy, but my unix sys. can handle it:-) and then parse out the individual sequnces into sub arrays. Due to varying size, I can't use @subarray=splice(@list, 0, 11); to pull out each sequence. I have to separate the sequences based on the > symbol. What would be the simplist way to say "slurp in the multiple lines of data between > and > and place them into subarray X"? My main worry is trying to code this in a way so that perl can keep its place along the way, so I don't get the same sequence pulled out 20,000 times rather then 20,000 different sequence arrays. As well, if there are any suggestions on making this as painless a memory hog as possible, I would greatly appreciate it. I apologize if this seems a dumb question, but I'm a self-taught perl hacker, and still pretty new, though playing with some nice Unix toys....
Re: Refomating a large fasta file...
by Itatsumaki (Friar) on Nov 19, 2003 at 00:29 UTC
|
I wouldn't do it this way if I were you! :)
From the size it looks like you're thinking of parsing nt, which is going to be slow no matter what. The sooner you start getting used to BioPerl the better off you are. With Bio::SeqIO this is a handful of lines. Further, the tired/tireless developers there are always looking for ways to speed up their parsers. It's a real friendly mailing list, too (bioperl-l@bioperl.org).
use Bio::SeqIO;
my $seqIO = Bio::SeqIO->new({-file=>'nt', -format=>'fasta'});
my $i = 0;
while (my $seq = $seqIO->next_seq()) {
$i++;
}
print "There are $i non-redundant nucleotide sequences\n";
The last thing any bioinformatician wants to do is rewrite parsers. There are dozens of sequence formats, and BioPerl allows you to parse them all in this way, just changing the -format to what you need. In fact, it even has some rudimentary auto-identification, I believe. You can also rewrite sequence objects with BioSeqIO, allowing for two or three line reformatters. It's an invaluable resource, and it isn't too hard to learn to use.
BioPerl 1.2.3 is available from CPAN, or you can get a CVS tarball from the BioPerl project site. | [reply] [d/l] [select] |
|
This is a great suggestion. However, I have attempted to use the various bioperl modules, but to no avail. Again, I'm still new to computer programming in general, and the lack of good documentation for bioperl modules has left me with little more than headaches :-). I'm working on using your suggestion currently, and also intend to join the mailing list. Thanks!!
| [reply] |
|
### USAGE
# perl -w splitter.pl <filename> <seq-per-file>
# e.g. perl -w splitter.pl nt 20000
### INCLUDES
use strict;
use Bio::SeqIO;
### LOCALS
my $infile = $ARGV[0];
my $size = $ARGV[1];
my $i = 0;
my $j = 1;
### OBJECT INSTANTIATION
my $in = Bio::SeqIO->new(
-file => $infile,
-format => 'Fasta',
);
my $out = Bio::SeqIO->new(
-file => ">$infile"."_$j.fasta",
-format => 'Fasta',
);
### PROCESS FILES
while ( my $seq = $in->next_seq() ) {
$i++;
if ($i == $size) {
$j++;
$out = Bio::SeqIO->new(
-file => ">$infile"."_$j.fasta",
-format => 'Fasta');
$i = 0;
}
$out->write_seq($seq);
}
| [reply] [d/l] |
Re: Refomating a large fasta file...
by Roger (Parson) on Nov 18, 2003 at 23:28 UTC
|
Yes, there is a trick, you can just localize the '$/' variable, set it to '>', and loop through your data. I haven't seen your actual data, but I made a few assumptions, and came up with the code below, which I believe is memory efficient.
use strict;
use Data::Dumper;
my @seq; # array of sequences
my %seq; # hash lookup of sequences based on id's
{
local $/ = '>';
while (<DATA>) {
s/^>//g; # strip out '>' from beginning
s/>$//g; # and end of line
next if !length($_); # ignore empty lines
my ($header_info) = /^(.*)\n/; # capture the header
s/^(.*)\n//; # and strip it out
my @rec = split /\|/, $header_info;
s/\n//mg; # join the sequence strings
push @rec, $_; # make sequence the last element
push @seq, \@rec; # push into array of sequences
$seq{$rec[1]} = \@rec; # or store in a hash lookup
}
}
print Dumper(\@seq);
print Dumper(\%seq);
__DATA__
>gi|2695845|emb|Y13255.1|ABY13255 Acipenser baeri mRNA for immunoglobu
+lin heavy chain, clone ScH 3.3
TGGTTACAACACTTTCTTCTTTCAATAACCACAATACTGCAGTACAATGGGGATTTTAACAGCTCTCTGT
+ATAATAATGA
TACCCCCGCGACCTTCTCGTGGACTGATCAATCTGGAAAAGCTTTT
>gi|2695846|emb|Y13255.1|ABY13255 Acipenser baeri mRNA for immunoglobu
+lin heavy chain, clone ScH 3.3
TGGTTACAACACTTTCTTCTTTCAATAACCACAATACTGCAGTACAATGGGGATTTTAACAGCTCTCTGT
+ATAATAATGA
TACCCCCGCGACCTTCTCGTGGACTGATCAATCTGGAAAAGCTTTT
| [reply] [d/l] |
Re: Refomating a large fasta file...
by duff (Parson) on Nov 18, 2003 at 23:12 UTC
|
change your input record separator to >
{
local $/ = ">";
while (<FILE>) {
...
}
}
| [reply] [d/l] |
|
When you change that record separator to ">", that means the first record will be empty. (One of the later responses handles that case). More importantly, the last chunk of data is beyond the last record separator and there is no record separator at the End Of File (EOF). Will that last chunk be picked up at all? (I'm not exactly sure how record separators and file reads work for that last chunk.)
| [reply] |
Re: Refomating a large fasta file...
by BrowserUk (Patriarch) on Nov 18, 2003 at 23:39 UTC
|
#! perl -slw
use strict;
use Data::Dumper;
local $/ = '>'; # Set record delimiter
my @sequences; # AoA
(undef) = scalar <DATA>; # Discard first (blank) record.
while( my $record = <DATA> ) { # Read each paragraph
# Split to lines
my @lines = split "\n", $record;
# Discard the last line if necessary
pop @lines if $lines[-1] eq '>';
# Build the array
push @sequences, [ split( '\|', shift @lines), join( '', @lines )
+];
}
print Dumper \@sequences;
__DATA__
>gi|2695846|emb|Y13255.1|ABY13255 Acipenser baeri mRNA for immunoglobu
+lin heavychain, clone ScH 3.3
TGGTTACAACACTTTCTTCTTTCAATAACCACAATACTGCAGTACAATGGGGATTTTAACAGCTCTCTGT
+ATAATAATGA
CAGCTCTATCAAGTGTCCGGTCTGATGTAGTGTTGACTGAGTCCGGACCAGCAGTTATAAAGCCTGGAGA
+GTCCCATAAA
CTGTCCTGTAAAGCCTCTGGATTCACATTCAGCAGCGCCTACATGAGCTGGGTTCGACAAGCTCCTGGAA
+AGGGTCTGGA
ATGGGTGGCTTATATTTACTCAGGTGGTAGTAGTACATACTATGCCCAGTCTGTCCAGGGAAGATTCGCC
+ATCTCCAGAG
ACGATTCCAACAGCATGCTGTATTTACAAATGAACAGCCTGAAGACTGAAGACACTGCCGTGTATTACTG
+TGCTCGGGGC
GGGCTGGGGTGGTCCCTTGACTACTGGGGGAAAGGCACAATGATCACCGTAACTTCTGCTACGCCATCAC
+CACCGACAGT
GTTTCCGCTTATGGAGTCATGTTGTTTGAGCGATATCTCGGGTCCTGTTGCTACGGGCTGCTTAGCAACC
+GGATTCTGCC
TACCCCCGCGACCTTCTCGTGGACTGATCAATCTGGAAAAGCTTTT
>gi|2695846|emb|Y13255.1|ABY13255 Acipenser baeri mRNA for immunoglobu
+lin heavychain, clone ScH 3.3
TGGTTACAACACTTTCTTCTTTCAATAACCACAATACTGCAGTACAATGGGGATTTTAACAGCTCTCTGT
+ATAATAATGA
CAGCTCTATCAAGTGTCCGGTCTGATGTAGTGTTGACTGAGTCCGGACCAGCAGTTATAAAGCCTGGAGA
+GTCCCATAAA
CTGTCCTGTAAAGCCTCTGGATTCACATTCAGCAGCGCCTACATGAGCTGGGTTCGACAAGCTCCTGGAA
+AGGGTCTGGA
ATGGGTGGCTTATATTTACTCAGGTGGTAGTAGTACATACTATGCCCAGTCTGTCCAGGGAAGATTCGCC
+ATCTCCAGAG
ACGATTCCAACAGCATGCTGTATTTACAAATGAACAGCCTGAAGACTGAAGACACTGCCGTGTATTACTG
+TGCTCGGGGC
GGGCTGGGGTGGTCCCTTGACTACTGGGGGAAAGGCACAATGATCACCGTAACTTCTGCTACGCCATCAC
+CACCGACAGT
GTTTCCGCTTATGGAGTCATGTTGTTTGAGCGATATCTCGGGTCCTGTTGCTACGGGCTGCTTAGCAACC
+GGATTCTGCC
TACCCCCGCGACCTTCTCGTGGACTGATCAATCTGGAAAAGCTTTT
>gi|2695846|emb|Y13255.1|ABY13255 Acipenser baeri mRNA for immunoglobu
+lin heavychain, clone ScH 3.3
TGGTTACAACACTTTCTTCTTTCAATAACCACAATACTGCAGTACAATGGGGATTTTAACAGCTCTCTGT
+ATAATAATGA
CAGCTCTATCAAGTGTCCGGTCTGATGTAGTGTTGACTGAGTCCGGACCAGCAGTTATAAAGCCTGGAGA
+GTCCCATAAA
CTGTCCTGTAAAGCCTCTGGATTCACATTCAGCAGCGCCTACATGAGCTGGGTTCGACAAGCTCCTGGAA
+AGGGTCTGGA
ATGGGTGGCTTATATTTACTCAGGTGGTAGTAGTACATACTATGCCCAGTCTGTCCAGGGAAGATTCGCC
+ATCTCCAGAG
ACGATTCCAACAGCATGCTGTATTTACAAATGAACAGCCTGAAGACTGAAGACACTGCCGTGTATTACTG
+TGCTCGGGGC
GGGCTGGGGTGGTCCCTTGACTACTGGGGGAAAGGCACAATGATCACCGTAACTTCTGCTACGCCATCAC
+CACCGACAGT
GTTTCCGCTTATGGAGTCATGTTGTTTGAGCGATATCTCGGGTCCTGTTGCTACGGGCTGCTTAGCAACC
+GGATTCTGCC
TACCCCCGCGACCTTCTCGTGGACTGATCAATCTGGAAAAGCTTTT
>gi|2695846|emb|Y13255.1|ABY13255 Acipenser baeri mRNA for immunoglobu
+lin heavychain, clone ScH 3.3
TGGTTACAACACTTTCTTCTTTCAATAACCACAATACTGCAGTACAATGGGGATTTTAACAGCTCTCTGT
+ATAATAATGA
CAGCTCTATCAAGTGTCCGGTCTGATGTAGTGTTGACTGAGTCCGGACCAGCAGTTATAAAGCCTGGAGA
+GTCCCATAAA
CTGTCCTGTAAAGCCTCTGGATTCACATTCAGCAGCGCCTACATGAGCTGGGTTCGACAAGCTCCTGGAA
+AGGGTCTGGA
ATGGGTGGCTTATATTTACTCAGGTGGTAGTAGTACATACTATGCCCAGTCTGTCCAGGGAAGATTCGCC
+ATCTCCAGAG
ACGATTCCAACAGCATGCTGTATTTACAAATGAACAGCCTGAAGACTGAAGACACTGCCGTGTATTACTG
+TGCTCGGGGC
GGGCTGGGGTGGTCCCTTGACTACTGGGGGAAAGGCACAATGATCACCGTAACTTCTGCTACGCCATCAC
+CACCGACAGT
GTTTCCGCTTATGGAGTCATGTTGTTTGAGCGATATCTCGGGTCCTGTTGCTACGGGCTGCTTAGCAACC
+GGATTCTGCC
TACCCCCGCGACCTTCTCGTGGACTGATCAATCTGGAAAAGCTTTT
Output
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
Hooray!
Wanted!
| [reply] [d/l] [select] |
Re: Refomating a large fasta file...
by Hena (Friar) on Nov 19, 2003 at 10:21 UTC
|
Although this is not a perl answer, but is good to point when doing bioinformatics and have *nix.
Emboss has a lot of commands to handle sequences (eg. for searching dreg or preg and fuzz*). Also for example blast and interproscan can be downloaded and used locally. And as Itatsumaki said, Bioperl can help out a lot. | [reply] |
|
|