I see some trouble with your code to parse out the sequences from SEQRES lines. There is some redundant information in the .pdb format. I used the sequential letter (A,B,C,D,E,etc) to delineate start/end of the various sequences. Number of 3 letter groups in the sequence would be another way. I am not sure what you are trying to do. At the end of code, see a couple of example FASTA segments I cut n' pasted from the program output.
I have no ide what to do with the x,y,z atom coordinates or even what they mean. I averaged them for fun. I changed your regex a bit to make it easier to pick out the +-floating point numbers.
To run code, download the .pdb file and name it "6U9D.pdb". In production code, I probably would not use so much memory for temporary results, preferring to calculate and output results as they occur rather than save a bunch of data and process it at the end. However for these prototype things, having the data in intermediate forms can useful for debugging. Have fun.
use strict;
use warnings;
open my $pdb_fh, '<', "6U9D.pdb" or die "unable to open 6U9D.pdb for r
+eading $!";
my %amino_acid_conversion = (
ALA => 'A',
ARG => 'R',
ASN => 'N',
ASP => 'D',
CYS => 'C',
GLN => 'Q',
GLU => 'E',
GLY => 'G',
HIS => 'H',
ILE => 'I',
LEU => 'L',
LYS => 'K',
MET => 'M',
PHE => 'F',
PRO => 'P',
SER => 'S',
THR => 'T',
TRP => 'W',
TYR => 'Y',
VAL => 'V');
my @atoms; #2D array of x,y,z of all atoms (many thousands)
my @seqs; #raw sequences with the 3 letter designations
#convert later to FASTA format
my $cur_seq = '';
my $cur_ltr = '';
# parse out data of interest from file
#
while (<$pdb_fh>)
{
if (my ($ltr, $seq) = (/^SEQRES\s+\d+\s+(\w+)\s+\d+\s+([A-Z ]+)$/)
+)
{
$seq =~ s/\s+$//;
if ($ltr eq $cur_ltr)
{
$cur_seq .= " $seq";
}
else
{
push @seqs, $cur_seq if $cur_seq ne ''; # end of current seq
+uence
$cur_seq = $seq; # begin of the next
+sequence
$cur_ltr = $ltr;
}
}
elsif (my ($x,$y,$z) = (/^ATOM\s+.*?([\d.-]+)\s+([\d.-]+)\s+([\d.-]
++)/) )
{
push @atoms, [$x,$y,$z];
}
}
push @seqs, $cur_seq; # don't forget to finish the last seq!
#### output collected data ###
#make a fasta sequence segments
foreach my $seq (@seqs)
{
# my $fasta = join '',map{$amino_acid_conversion{$_}}split ' ',$seq
+;
# without using a map:
#
my $fasta ='';
foreach my $char3 (split ' ',$seq)
{
$fasta.= $amino_acid_conversion{$char3}
}
print ">Some Fasta Description Line\n"; #use 60 char lines
while ($fasta) #fasta suggested max is 80
{
print substr($fasta,0,60,''),"\n";
}
}
#print the data points
# I am not sure what needs to be done with them
# average of each coordinate?
#foreach my $row_ref (@atoms) #uncomment to print
#{
# print @$row_ref,"\n";
#}
my $xsum; my $ysum; my $zsum;
foreach my $row_ref (@atoms) # @atoms is a 2D array
{
my ($x, $y , $z ) = @$row_ref;
$xsum+=$x;
$ysum+=$y;
$zsum+=$z;
}
print "avg x = ",$xsum/@atoms,"\n";
print "avg y = ",$ysum/@atoms,"\n";
print "avg z = ",$zsum/@atoms,"\n";
__END__
These are 2 examples:
You will have to figure out what goes in the FASTA description line
And perhaps not all of these sequences are relevant? Looks like
a lot are duplicates.
>Some Fasta Description Line
MHHHHHHENLYFQGAPSFNVDPLEQPAEPSKLAKKLRAEPDMDTSFVGLTGGQIFNEMMS
RQNVDTVFGYPGGAILPVYDAIHNSDKFNFVLPKHEQGAGHMAEGYARASGKPGVVLVTS
GPGATNVVTPMADAFADGIPMVVFTGQVPTSAIGTDAFQEADVVGISRSCTKWNVMVKSV
EELPLRINEAFEIATSGRPGPVLVDLPKDVTAAILRNPIPTKTTLPSNALNQLTSRAQDE
FVMQSINKAADLINLAKKPVLYVGAGILNHADGPRLLKELSDRAQIPVTTTLQGLGSFDQ
EDPKSLDMLGMHGCATANLAVQNADLIIAVGARFDDRVTGNISKFAPEARRAAAEGRGGI
IHFEVSPKNINKVVQTQIAVEGDATTNLGKMMSKIFPVKERSEWFAQINKWKKEYPYAYM
EETPGSKIKPQTVIKKLSKVANDTGRHVIVTTGVGQHQMWAAQHWTWRNPHTFITSGGLG
TMGYGLPAAIGAQVAKPESLVIDIDGDASFNMTLTELSSAVQAGTPVKILILNNEEQGMV
TQWQSLFYEHRYSHTHQLNPDFIKLAEAMGLKGLRVKKQEELDAKLKEFVSTKGPVLLEV
EVDKKVPVLPMVAGGSGLDEFINFDPEVERQQTELRHKRTGGKH
>Some Fasta Description Line
MGSSHHHHHHSSGLVPRGSHMENLYFQGATRPPLPTLDTPSWNANSAVSSIIYETPAPSR
QPRKQHVLNCLVQNEPGVLSRVSGTLAARGFNIDSLVVCNTEVKDLSRMTIVLQGQDGVI
EQARRQIEDLVPVYAVLDYTNSEIIKRELVMARISLLGTEYFEDLLLHHHTSTNAGAADS
QELVAEIREKQFHPANLPASEVLRLKHEHLNDITNLTNNFGGRVVDISETSCIVELSAKP
TRISAFLKLVEPFGVLECARSGMMALPRTPLKTSTEEAADEDEKISEIVDISQLPPG
I have no idea what these numbers would mean?
avg x = 321.013155298296
avg y = 290.744642162734
avg z = 69.196842162731
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.