How do I extract ORFs from a fasta file into a new fasta file

Wasp_Guy has asked for the wisdom of the Perl Monks concerning the following question:

Hi Pearl Monks. In short: I am a Biochemistry Ph.D. student working on a proteomics project. I am new to programming and Perl. I have an assembled RNA-seq fasta file and I need to extract all of the ORFs into a new fasta file that I can use to blast proteome data against. Any advice on how I can proceed would be very appreciated.

The long story: The RNA-seq data I have is a de-novo assembly of illumina data without a reference genome. This file is far to long to open it on my computer let alone go through it by hand. I also have some mass spec data that returned tryptic peptide sequences from the same tissue. I would like to pull out all of the full length CDS with some 5' and 3' UTR info for the proteins I have found in the tissue by mass spec. I have tried simply blasting the peptides against the assembly but I get hits that are not in open reading frames. It is my hope that if I have a database of only ORFs that identification of my peptides transcripts would be easier. I have read Borisz answer to a similar question in node id=473744 back in 2005. I have copied his code below. I believe this is a good place to start, but I would suppose the major difference is that I need to return all the ORFs from tens of thousands of entries as a new fasta file. Thank you for your consideration.

local $_  = $your_input_string;
while ( /ATG/g ) {
   my $start = pos() - 3;
   if ( /T(?:AA|AG|GA)/g ) {
     my $stop = pos;
     print $start, " ", $stop, " ", $stop - $start, " ", 
       substr ($_, $start, $stop - $start), $/;     
   }
}
[download]

Comment on How do I extract ORFs from a fasta file into a new fasta file Download Code

Replies are listed 'Best First'.
Re: How do I extract ORFs from a fasta file into a new fasta file by Anonymous Monk on Jul 07, 2013 at 04:02 UTC
You seem to have taken the formatting advice to heart ;) so I'd like to point you to read RFC: Bioinformatics Tutorial, Perl and Bioinformatics, and understand that most folks aren't studying bioinformatics, and aren't working on a PhD in bioinformatics, and aren't familiar with bioinformatic jargon I think Re: extracting open reading frame (ORF) from a FASTA file will already print all cough ORFs, so I don't know what you're wanting to do , that the code you found doesn't do already	[reply]
Re: How do I extract ORFs from a fasta file into a new fasta file by zork42 (Monk) on Jul 08, 2013 at 01:40 UTC
(Copied from my post in this thread: how do i obtain blast result from the given file.) If you're looking for some good books, I can highly recommend these 3 books which form a 3 part series: Learning Perl: Making Easy Things Easy and Hard Things Possible Intermediate Perl: Beyond The Basics of Learning Perl Mastering Perl: Creating Professional Programs with Perl I just happened to notice this. I have no idea what it it like, but it might be just what you want: Mastering Perl for Bioinformatics	[reply]
Re: How do I extract ORFs from a fasta file into a new fasta file by biohisham (Priest) on Jul 08, 2013 at 08:29 UTC
"I would like to pull out all of the full length CDS with some 5' and 3' UTR info for the proteins I have found in the tissue by mass spec" If I understand you correctly, I will look at the gff/gtf coordinates for the CDs feature and try to obtain that for the specific protein annotation in the gff file, A fastA file may not have these features handy so it maybe not the way to go unless you have an annotated genome somewhere for you to extract this stuff from. intro to GFF format GFF Object modules extension of GFF object modules "I have tried simply blasting the peptides against the assembly but I get hits that are not in open reading frames" Maybe because BLAST is a local alignment and search tool, that behavior is totally expected. " I would suppose the major difference is that I need to return all the ORFs from tens of thousands of entries as a new fasta file." Read the file one line at a time. Maybe something like what follows ?... If you don't have BioPerl installed then certainly there are many other examples on reading fastA file that don't use BioPerl around here `#UNTESTED CODE TO DEMONSTRATE A WAY AROUND use strict; use warnings; use Bio::SeqIO; my $in = Bio::SeqIO->new(-file =>"FastA.fa", format = "FASTA"); while(my $seq = $in->next_seq){ my $sequence = $seq->seq; #Do something with $sequence my $ORFpattern = "foo"; if($sequence =~ /$ORFpattern/){ #report or anything } }` [download] "Hi Pearl Monks" we are Perl Monks and not Pearl Monks... Welcome to the Monastery and good luck.... David R. Gergen said "We know that second terms have historically been marred by hubris and by scandal." and I am a two y.o. monk today :D, June,12th, 2011...	[reply] [d/l] [select]
Re: How do I extract ORFs from a fasta file into a new fasta file by Anonymous Monk on Jul 07, 2013 at 16:41 UTC
Forgive me for replying to a question i know nothing about. But you say you are new to programming, and perl, and you found this code somewhere on the internet. I doubt very much that that would be a good place to start. Best regards.	[reply]

Back to Seekers of Perl Wisdom