You are missing a \s+ between a \w+ and \d+. Also, the whole part needs to be moved inside the loop.
#! /usr/bin/perl
use warnings;
use strict;
open my $in, '<', '6U9D.pdb' or die $!;
my %amino_acid_conversion = (ALA => 'A', TYR => 'Y', MET => 'M', LEU =
+> 'L',
CYS => 'C', GLY => 'G', ARG => 'R', ASN =
+> 'N',
ASP => 'D', GLN => 'Q', GLU => 'E', HIS =
+> 'H',
TRP => 'W', LYS => 'K', PHE => 'F', PRO =
+> 'P',
SER => 'S', THR => 'T', ILE => 'I', VAL =
+> 'V');
my ($x, $y, $z) = (0, 0, 0);
my $seq;
while (<$in>) {
if (/HEADER\s+(.*)/) {
print ">$1\n";
}
if (/^SEQRES\s+\d+\s+\w+\s+\d+\s+(.*)/) {
$seq .= $1;
}
if (/^ATOM\s+\d+\s+\w+\s+\w+\s+\w+\s+\d+\s+(\S+)\s+(\S+)\s+(\S+)/)
+ {
# ~~~
$x += $1;
$y += $2;
$z += $3;
}
}
$seq =~ s/ //g;
my $k = 0;
for (my $i = 0; $i < length $seq; $i += 3) {
my $triplet = substr $seq, $i, 3;
if (exists $amino_acid_conversion{$triplet}) {
print "$amino_acid_conversion{$triplet}";
} else {
warn "Don't know how to convert '$triplet'.\n";
}
$k++;
}
print "\n";
my $xgk = $x / $k;
my $ygk = $y / $k;
my $zgk = $z / $k;
print "$xgk $ygk $zgk \n";
Note that I also turned strict and warnings on, used the 3-argument version of open without bareword filehandles, and checked its return value.
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
if ($_=~m/^ATOM\s+\d+\s+\w+\s+\w+\s+\w+\d+\s+(\S+)\s+(\S+)\s+(\S+)/){
Here you match against $_, but it was never set. Did you want to match against $seq? I'm not sure.
Some general stylistic things:
open (IN, '6U9D.pdb.txt');
You should check whether opening the file actually succeeded:
my $filename = '6U9D.pdb.txt';
open (IN, $filename)
or die "Couldn't open '$filename': $!";
You should use the strict pragma. This requires you to declare all variables but on the upside, it allows Perl to tell you about typos in your variable names.
When assigning the %amino_acid_conversion, you could use newlines to make the code a bit more readable:
%amino_acid_conversion = (
ALA=>'A',
TYR=>'Y',
MET=>'M',
LEU=>'L',
CYS=>'C',
GLY=>'G',
ARG=>'R',
ASN=>'N',
ASP=>'D',
GLN=>'Q',
GLU=>'E',
HIS=>'H',
TRP=>'W',
LYS=>'K',
PHE=>'F',
PRO=>'P',
SER=>'S',
THR=>'T',
ILE=>'I',
VAL=>'V'
);
| [reply] [d/l] [select] |
%amino_acid_conversion = (
ALA => 'A',
ARG => 'R',
ASN => 'N',
ASP => 'D',
CYS => 'C',
GLN => 'Q',
GLU => 'E',
GLY => 'G',
HIS => 'H',
ILE => 'I',
LEU => 'L',
LYS => 'K',
MET => 'M',
PHE => 'F',
PRO => 'P',
SER => 'S',
THR => 'T',
TRP => 'W',
TYR => 'Y',
VAL => 'V'
);
| [reply] [d/l] [select] |
use YAML::XS qw( Load );
my %conversions = %{Load(<<'EOT')};
---
ALA: A
TYR: Y
MET: M
...
EOT
The cake is a lie.
The cake is a lie.
The cake is a lie.
| [reply] [d/l] |
I see some trouble with your code to parse out the sequences from SEQRES lines. There is some redundant information in the .pdb format. I used the sequential letter (A,B,C,D,E,etc) to delineate start/end of the various sequences. Number of 3 letter groups in the sequence would be another way. I am not sure what you are trying to do. At the end of code, see a couple of example FASTA segments I cut n' pasted from the program output.
I have no ide what to do with the x,y,z atom coordinates or even what they mean. I averaged them for fun. I changed your regex a bit to make it easier to pick out the +-floating point numbers.
To run code, download the .pdb file and name it "6U9D.pdb". In production code, I probably would not use so much memory for temporary results, preferring to calculate and output results as they occur rather than save a bunch of data and process it at the end. However for these prototype things, having the data in intermediate forms can useful for debugging. Have fun.
use strict;
use warnings;
open my $pdb_fh, '<', "6U9D.pdb" or die "unable to open 6U9D.pdb for r
+eading $!";
my %amino_acid_conversion = (
ALA => 'A',
ARG => 'R',
ASN => 'N',
ASP => 'D',
CYS => 'C',
GLN => 'Q',
GLU => 'E',
GLY => 'G',
HIS => 'H',
ILE => 'I',
LEU => 'L',
LYS => 'K',
MET => 'M',
PHE => 'F',
PRO => 'P',
SER => 'S',
THR => 'T',
TRP => 'W',
TYR => 'Y',
VAL => 'V');
my @atoms; #2D array of x,y,z of all atoms (many thousands)
my @seqs; #raw sequences with the 3 letter designations
#convert later to FASTA format
my $cur_seq = '';
my $cur_ltr = '';
# parse out data of interest from file
#
while (<$pdb_fh>)
{
if (my ($ltr, $seq) = (/^SEQRES\s+\d+\s+(\w+)\s+\d+\s+([A-Z ]+)$/)
+)
{
$seq =~ s/\s+$//;
if ($ltr eq $cur_ltr)
{
$cur_seq .= " $seq";
}
else
{
push @seqs, $cur_seq if $cur_seq ne ''; # end of current seq
+uence
$cur_seq = $seq; # begin of the next
+sequence
$cur_ltr = $ltr;
}
}
elsif (my ($x,$y,$z) = (/^ATOM\s+.*?([\d.-]+)\s+([\d.-]+)\s+([\d.-]
++)/) )
{
push @atoms, [$x,$y,$z];
}
}
push @seqs, $cur_seq; # don't forget to finish the last seq!
#### output collected data ###
#make a fasta sequence segments
foreach my $seq (@seqs)
{
# my $fasta = join '',map{$amino_acid_conversion{$_}}split ' ',$seq
+;
# without using a map:
#
my $fasta ='';
foreach my $char3 (split ' ',$seq)
{
$fasta.= $amino_acid_conversion{$char3}
}
print ">Some Fasta Description Line\n"; #use 60 char lines
while ($fasta) #fasta suggested max is 80
{
print substr($fasta,0,60,''),"\n";
}
}
#print the data points
# I am not sure what needs to be done with them
# average of each coordinate?
#foreach my $row_ref (@atoms) #uncomment to print
#{
# print @$row_ref,"\n";
#}
my $xsum; my $ysum; my $zsum;
foreach my $row_ref (@atoms) # @atoms is a 2D array
{
my ($x, $y , $z ) = @$row_ref;
$xsum+=$x;
$ysum+=$y;
$zsum+=$z;
}
print "avg x = ",$xsum/@atoms,"\n";
print "avg y = ",$ysum/@atoms,"\n";
print "avg z = ",$zsum/@atoms,"\n";
__END__
These are 2 examples:
You will have to figure out what goes in the FASTA description line
And perhaps not all of these sequences are relevant? Looks like
a lot are duplicates.
>Some Fasta Description Line
MHHHHHHENLYFQGAPSFNVDPLEQPAEPSKLAKKLRAEPDMDTSFVGLTGGQIFNEMMS
RQNVDTVFGYPGGAILPVYDAIHNSDKFNFVLPKHEQGAGHMAEGYARASGKPGVVLVTS
GPGATNVVTPMADAFADGIPMVVFTGQVPTSAIGTDAFQEADVVGISRSCTKWNVMVKSV
EELPLRINEAFEIATSGRPGPVLVDLPKDVTAAILRNPIPTKTTLPSNALNQLTSRAQDE
FVMQSINKAADLINLAKKPVLYVGAGILNHADGPRLLKELSDRAQIPVTTTLQGLGSFDQ
EDPKSLDMLGMHGCATANLAVQNADLIIAVGARFDDRVTGNISKFAPEARRAAAEGRGGI
IHFEVSPKNINKVVQTQIAVEGDATTNLGKMMSKIFPVKERSEWFAQINKWKKEYPYAYM
EETPGSKIKPQTVIKKLSKVANDTGRHVIVTTGVGQHQMWAAQHWTWRNPHTFITSGGLG
TMGYGLPAAIGAQVAKPESLVIDIDGDASFNMTLTELSSAVQAGTPVKILILNNEEQGMV
TQWQSLFYEHRYSHTHQLNPDFIKLAEAMGLKGLRVKKQEELDAKLKEFVSTKGPVLLEV
EVDKKVPVLPMVAGGSGLDEFINFDPEVERQQTELRHKRTGGKH
>Some Fasta Description Line
MGSSHHHHHHSSGLVPRGSHMENLYFQGATRPPLPTLDTPSWNANSAVSSIIYETPAPSR
QPRKQHVLNCLVQNEPGVLSRVSGTLAARGFNIDSLVVCNTEVKDLSRMTIVLQGQDGVI
EQARRQIEDLVPVYAVLDYTNSEIIKRELVMARISLLGTEYFEDLLLHHHTSTNAGAADS
QELVAEIREKQFHPANLPASEVLRLKHEHLNDITNLTNNFGGRVVDISETSCIVELSAKP
TRISAFLKLVEPFGVLECARSGMMALPRTPLKTSTEEAADEDEKISEIVDISQLPPG
I have no idea what these numbers would mean?
avg x = 321.013155298296
avg y = 290.744642162734
avg z = 69.196842162731
| [reply] [d/l] |