Re: please help me!!

in reply to please help me parse genbank DNA file

Have you looked into Bioperl? It will simplify parsing for you (especially for the sequence itself). Here's a program that gets the sequence and some other basic information:

#!/usr/local/bin/perl -w
use strict;

use Bio::SeqIO;

my $seqobj;

print "please type in the name of a file\n";
my $file = <STDIN>;

my $seqio  = Bio::SeqIO->new (-format => 'GenBank',
                              -file =>   $file);
while ($seqobj = $seqio->next_seq())
{
  printf "Sequence: %s\n",$seqobj->seq;

  # I'm not sure what you need other than the
  # sequence - here's some examples:
  printf "Display ID:  %s\n",$seqobj->display_id;
  printf "Description: %s\n",$seqobj->desc;
  printf "Division:    %s\n",$seqobj->division;
  printf "Accession:   %s\n",$seqobj->accession;
}
[download]

In your program, you're putting all of the non-sequence lines into @annotation. I'm not sure specifically which information you need (i.e. descriprtion, accession number, etc.), but those are all accessible through the "$seqobj" object. There's some examples in the code above; you'll find many more in the documentation.

This method also has the advantage of being able to handle multiple GenBank records per file.

This is just a tiny portion of the functions available with BioPerl - it will also parse BLAST files, perform alignments, etc. If you're interested, you can grab the latest release from CPAN or from BioPerl here. Hope this helps!

- robsv

In Section Seekers of Perl Wisdom