comment on

Hi Pearl Monks. In short: I am a Biochemistry Ph.D. student working on a proteomics project. I am new to programming and Perl. I have an assembled RNA-seq fasta file and I need to extract all of the ORFs into a new fasta file that I can use to blast proteome data against. Any advice on how I can proceed would be very appreciated.

The long story: The RNA-seq data I have is a de-novo assembly of illumina data without a reference genome. This file is far to long to open it on my computer let alone go through it by hand. I also have some mass spec data that returned tryptic peptide sequences from the same tissue. I would like to pull out all of the full length CDS with some 5' and 3' UTR info for the proteins I have found in the tissue by mass spec. I have tried simply blasting the peptides against the assembly but I get hits that are not in open reading frames. It is my hope that if I have a database of only ORFs that identification of my peptides transcripts would be easier. I have read Borisz answer to a similar question in node id=473744 back in 2005. I have copied his code below. I believe this is a good place to start, but I would suppose the major difference is that I need to return all the ORFs from tens of thousands of entries as a new fasta file. Thank you for your consideration.

local $_  = $your_input_string;
while ( /ATG/g ) {
   my $start = pos() - 3;
   if ( /T(?:AA|AG|GA)/g ) {
     my $stop = pos;
     print $start, " ", $stop, " ", $stop - $start, " ", 
       substr ($_, $start, $stop - $start), $/;     
   }
}
[download]

In reply to How do I extract ORFs from a fasta file into a new fasta file by Wasp_Guy

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


more useful options
	PerlMonks