Refomating a large fasta file...

bioinformatics has asked for the wisdom of the Perl Monks concerning the following question:

Hello all
The NCBI database containing sequence information for hundreds of species allows for dl of major portions of their database, allowing for local searches of sequence information rather than having too many people hog the bandwidth. I have dl a specific flat file containing sequence information in fasta format, looking as such:

>gi|2695846|emb|Y13255.1|ABY13255 Acipenser baeri mRNA for immunoglobu
+lin heavy chain, clone ScH 3.3
TGGTTACAACACTTTCTTCTTTCAATAACCACAATACTGCAGTACAATGGGGATTTTAACAGCTCTCTGT
+ATAATAATGA
CAGCTCTATCAAGTGTCCGGTCTGATGTAGTGTTGACTGAGTCCGGACCAGCAGTTATAAAGCCTGGAGA
+GTCCCATAAA
CTGTCCTGTAAAGCCTCTGGATTCACATTCAGCAGCGCCTACATGAGCTGGGTTCGACAAGCTCCTGGAA
+AGGGTCTGGA
ATGGGTGGCTTATATTTACTCAGGTGGTAGTAGTACATACTATGCCCAGTCTGTCCAGGGAAGATTCGCC
+ATCTCCAGAG
ACGATTCCAACAGCATGCTGTATTTACAAATGAACAGCCTGAAGACTGAAGACACTGCCGTGTATTACTG
+TGCTCGGGGC
GGGCTGGGGTGGTCCCTTGACTACTGGGGGAAAGGCACAATGATCACCGTAACTTCTGCTACGCCATCAC
+CACCGACAGT
GTTTCCGCTTATGGAGTCATGTTGTTTGAGCGATATCTCGGGTCCTGTTGCTACGGGCTGCTTAGCAACC
+GGATTCTGCC
TACCCCCGCGACCTTCTCGTGGACTGATCAATCTGGAAAAGCTTTT
[download]

This flat file is composed of consecutive fasta sequences (like the one above), and approaches 9 gig in size. What I am trying to do is slurp the entire file into an array (crazy, but my unix sys. can handle it:-) and then parse out the individual sequnces into sub arrays. Due to varying size, I can't use @subarray=splice(@list, 0, 11); to pull out each sequence. I have to separate the sequences based on the > symbol. What would be the simplist way to say "slurp in the multiple lines of data between > and > and place them into subarray X"? My main worry is trying to code this in a way so that perl can keep its place along the way, so I don't get the same sequence pulled out 20,000 times rather then 20,000 different sequence arrays. As well, if there are any suggestions on making this as painless a memory hog as possible, I would greatly appreciate it. I apologize if this seems a dumb question, but I'm a self-taught perl hacker, and still pretty new, though playing with some nice Unix toys....

Bioinformatics

Comment on Refomating a large fasta file... Select or Download Code

Replies are listed 'Best First'.
Re: Refomating a large fasta file... by Itatsumaki (Friar) on Nov 19, 2003 at 00:29 UTC
I wouldn't do it this way if I were you! :) From the size it looks like you're thinking of parsing nt, which is going to be slow no matter what. The sooner you start getting used to BioPerl the better off you are. With Bio::SeqIO this is a handful of lines. Further, the tired/tireless developers there are always looking for ways to speed up their parsers. It's a real friendly mailing list, too (bioperl-l@bioperl.org). `use Bio::SeqIO; my $seqIO = Bio::SeqIO->new({-file=>'nt', -format=>'fasta'}); my $i = 0; while (my $seq = $seqIO->next_seq()) { $i++; } print "There are $i non-redundant nucleotide sequences\n";` [download] The last thing any bioinformatician wants to do is rewrite parsers. There are dozens of sequence formats, and BioPerl allows you to parse them all in this way, just changing the `-format` to what you need. In fact, it even has some rudimentary auto-identification, I believe. You can also rewrite sequence objects with BioSeqIO, allowing for two or three line reformatters. It's an invaluable resource, and it isn't too hard to learn to use. BioPerl 1.2.3 is available from CPAN, or you can get a CVS tarball from the BioPerl project site.	[reply] [d/l] [select]
Re: Re: Refomating a large fasta file... by bioinformatics (Friar) on Nov 19, 2003 at 18:06 UTC
This is a great suggestion. However, I have attempted to use the various bioperl modules, but to no avail. Again, I'm still new to computer programming in general, and the lack of good documentation for bioperl modules has left me with little more than headaches :-). I'm working on using your suggestion currently, and also intend to join the mailing list. Thanks!! Bioinformatics	[reply]
Re: Re: Re: Refomating a large fasta file... by Itatsumaki (Friar) on Nov 19, 2003 at 18:19 UTC
As I reread your initial post, I suspect you want to split nt into smaller chunks and write those to file. Here is a script I've been using to do that ### USAGE # perl -w splitter.pl <filename> <seq-per-file> # e.g. perl -w splitter.pl nt 20000 ### INCLUDES use strict; use Bio::SeqIO; ### LOCALS my $infile = $ARGV[0]; my $size = $ARGV[1]; my $i = 0; my $j = 1; ### OBJECT INSTANTIATION my $in = Bio::SeqIO->new( -file => $infile, -format => 'Fasta', ); my $out = Bio::SeqIO->new( -file => ">$infile"."_$j.fasta", -format => 'Fasta', ); ### PROCESS FILES while ( my $seq = $in->next_seq() ) { $i++; if ($i == $size) { $j++; $out = Bio::SeqIO->new( -file => ">$infile"."_$j.fasta", -format => 'Fasta'); $i = 0; } $out->write_seq($seq); } [download]	[reply] [d/l]
Re: Refomating a large fasta file... by Roger (Parson) on Nov 18, 2003 at 23:28 UTC
Yes, there is a trick, you can just localize the '$/' variable, set it to '>', and loop through your data. I haven't seen your actual data, but I made a few assumptions, and came up with the code below, which I believe is memory efficient. use strict; use Data::Dumper; my @seq; # array of sequences my %seq; # hash lookup of sequences based on id's { local $/ = '>'; while (<DATA>) { s/^>//g; # strip out '>' from beginning s/>$//g; # and end of line next if !length($_); # ignore empty lines my ($header_info) = /^(.)\n/; # capture the header s/^(.)\n//; # and strip it out my @rec = split /\\|/, $header_info; s/\n//mg; # join the sequence strings push @rec, $_; # make sequence the last element push @seq, \@rec; # push into array of sequences $seq{$rec[1]} = \@rec; # or store in a hash lookup } } print Dumper(\@seq); print Dumper(\%seq); __DATA__ >gi\|2695845\|emb\|Y13255.1\|ABY13255 Acipenser baeri mRNA for immunoglobu +lin heavy chain, clone ScH 3.3 TGGTTACAACACTTTCTTCTTTCAATAACCACAATACTGCAGTACAATGGGGATTTTAACAGCTCTCTGT +ATAATAATGA TACCCCCGCGACCTTCTCGTGGACTGATCAATCTGGAAAAGCTTTT >gi\|2695846\|emb\|Y13255.1\|ABY13255 Acipenser baeri mRNA for immunoglobu +lin heavy chain, clone ScH 3.3 TGGTTACAACACTTTCTTCTTTCAATAACCACAATACTGCAGTACAATGGGGATTTTAACAGCTCTCTGT +ATAATAATGA TACCCCCGCGACCTTCTCGTGGACTGATCAATCTGGAAAAGCTTTT [download]	[reply] [d/l]
Re: Refomating a large fasta file... by duff (Parson) on Nov 18, 2003 at 23:12 UTC
change your input record separator to > `{ local $/ = ">"; while (<FILE>) { ... } }` [download] duff	[reply] [d/l]
Re: Re: Refomating a large fasta file... by Anonymous Monk on Nov 19, 2003 at 21:04 UTC
When you change that record separator to ">", that means the first record will be empty. (One of the later responses handles that case). More importantly, the last chunk of data is beyond the last record separator and there is no record separator at the End Of File (EOF). Will that last chunk be picked up at all? (I'm not exactly sure how record separators and file reads work for that last chunk.)	[reply]
Re: Refomating a large fasta file... by BrowserUk (Patriarch) on Nov 18, 2003 at 23:39 UTC
See perlvar $INPUT_RECORD_SEPARATOR. #! perl -slw use strict; use Data::Dumper; local $/ = '>'; # Set record delimiter my @sequences; # AoA (undef) = scalar <DATA>; # Discard first (blank) record. while( my $record = <DATA> ) { # Read each paragraph # Split to lines my @lines = split "\n", $record; # Discard the last line if necessary pop @lines if $lines[-1] eq '>'; # Build the array push @sequences, [ split( '\\|', shift @lines), join( '', @lines ) +]; } print Dumper \@sequences; __DATA__ >gi\|2695846\|emb\|Y13255.1\|ABY13255 Acipenser baeri mRNA for immunoglobu +lin heavychain, clone ScH 3.3 TGGTTACAACACTTTCTTCTTTCAATAACCACAATACTGCAGTACAATGGGGATTTTAACAGCTCTCTGT +ATAATAATGA CAGCTCTATCAAGTGTCCGGTCTGATGTAGTGTTGACTGAGTCCGGACCAGCAGTTATAAAGCCTGGAGA +GTCCCATAAA CTGTCCTGTAAAGCCTCTGGATTCACATTCAGCAGCGCCTACATGAGCTGGGTTCGACAAGCTCCTGGAA +AGGGTCTGGA ATGGGTGGCTTATATTTACTCAGGTGGTAGTAGTACATACTATGCCCAGTCTGTCCAGGGAAGATTCGCC +ATCTCCAGAG ACGATTCCAACAGCATGCTGTATTTACAAATGAACAGCCTGAAGACTGAAGACACTGCCGTGTATTACTG +TGCTCGGGGC GGGCTGGGGTGGTCCCTTGACTACTGGGGGAAAGGCACAATGATCACCGTAACTTCTGCTACGCCATCAC +CACCGACAGT GTTTCCGCTTATGGAGTCATGTTGTTTGAGCGATATCTCGGGTCCTGTTGCTACGGGCTGCTTAGCAACC +GGATTCTGCC TACCCCCGCGACCTTCTCGTGGACTGATCAATCTGGAAAAGCTTTT >gi\|2695846\|emb\|Y13255.1\|ABY13255 Acipenser baeri mRNA for immunoglobu +lin heavychain, clone ScH 3.3 TGGTTACAACACTTTCTTCTTTCAATAACCACAATACTGCAGTACAATGGGGATTTTAACAGCTCTCTGT +ATAATAATGA CAGCTCTATCAAGTGTCCGGTCTGATGTAGTGTTGACTGAGTCCGGACCAGCAGTTATAAAGCCTGGAGA +GTCCCATAAA CTGTCCTGTAAAGCCTCTGGATTCACATTCAGCAGCGCCTACATGAGCTGGGTTCGACAAGCTCCTGGAA +AGGGTCTGGA ATGGGTGGCTTATATTTACTCAGGTGGTAGTAGTACATACTATGCCCAGTCTGTCCAGGGAAGATTCGCC +ATCTCCAGAG ACGATTCCAACAGCATGCTGTATTTACAAATGAACAGCCTGAAGACTGAAGACACTGCCGTGTATTACTG +TGCTCGGGGC GGGCTGGGGTGGTCCCTTGACTACTGGGGGAAAGGCACAATGATCACCGTAACTTCTGCTACGCCATCAC +CACCGACAGT GTTTCCGCTTATGGAGTCATGTTGTTTGAGCGATATCTCGGGTCCTGTTGCTACGGGCTGCTTAGCAACC +GGATTCTGCC TACCCCCGCGACCTTCTCGTGGACTGATCAATCTGGAAAAGCTTTT >gi\|2695846\|emb\|Y13255.1\|ABY13255 Acipenser baeri mRNA for immunoglobu +lin heavychain, clone ScH 3.3 TGGTTACAACACTTTCTTCTTTCAATAACCACAATACTGCAGTACAATGGGGATTTTAACAGCTCTCTGT +ATAATAATGA CAGCTCTATCAAGTGTCCGGTCTGATGTAGTGTTGACTGAGTCCGGACCAGCAGTTATAAAGCCTGGAGA +GTCCCATAAA CTGTCCTGTAAAGCCTCTGGATTCACATTCAGCAGCGCCTACATGAGCTGGGTTCGACAAGCTCCTGGAA +AGGGTCTGGA ATGGGTGGCTTATATTTACTCAGGTGGTAGTAGTACATACTATGCCCAGTCTGTCCAGGGAAGATTCGCC +ATCTCCAGAG ACGATTCCAACAGCATGCTGTATTTACAAATGAACAGCCTGAAGACTGAAGACACTGCCGTGTATTACTG +TGCTCGGGGC GGGCTGGGGTGGTCCCTTGACTACTGGGGGAAAGGCACAATGATCACCGTAACTTCTGCTACGCCATCAC +CACCGACAGT GTTTCCGCTTATGGAGTCATGTTGTTTGAGCGATATCTCGGGTCCTGTTGCTACGGGCTGCTTAGCAACC +GGATTCTGCC TACCCCCGCGACCTTCTCGTGGACTGATCAATCTGGAAAAGCTTTT >gi\|2695846\|emb\|Y13255.1\|ABY13255 Acipenser baeri mRNA for immunoglobu +lin heavychain, clone ScH 3.3 TGGTTACAACACTTTCTTCTTTCAATAACCACAATACTGCAGTACAATGGGGATTTTAACAGCTCTCTGT +ATAATAATGA CAGCTCTATCAAGTGTCCGGTCTGATGTAGTGTTGACTGAGTCCGGACCAGCAGTTATAAAGCCTGGAGA +GTCCCATAAA CTGTCCTGTAAAGCCTCTGGATTCACATTCAGCAGCGCCTACATGAGCTGGGTTCGACAAGCTCCTGGAA +AGGGTCTGGA ATGGGTGGCTTATATTTACTCAGGTGGTAGTAGTACATACTATGCCCAGTCTGTCCAGGGAAGATTCGCC +ATCTCCAGAG ACGATTCCAACAGCATGCTGTATTTACAAATGAACAGCCTGAAGACTGAAGACACTGCCGTGTATTACTG +TGCTCGGGGC GGGCTGGGGTGGTCCCTTGACTACTGGGGGAAAGGCACAATGATCACCGTAACTTCTGCTACGCCATCAC +CACCGACAGT GTTTCCGCTTATGGAGTCATGTTGTTTGAGCGATATCTCGGGTCCTGTTGCTACGGGCTGCTTAGCAACC +GGATTCTGCC TACCCCCGCGACCTTCTCGTGGACTGATCAATCTGGAAAAGCTTTT [download] Output Read more... (6 kB) Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail Hooray! Wanted!	[reply] [d/l] [select]
Re: Refomating a large fasta file... by Hena (Friar) on Nov 19, 2003 at 10:21 UTC
Although this is not a perl answer, but is good to point when doing bioinformatics and have nix. Emboss has a lot of commands to handle sequences (eg. for searching dreg or preg and fuzz). Also for example blast and interproscan can be downloaded and used locally. And as Itatsumaki said, Bioperl can help out a lot.	[reply]


laziness, impatience, and hubris
	PerlMonks