Nickmofoe has asked for the wisdom of the Perl Monks concerning the following question:

Greetings fellow monks. I have picked up perl for my bioinformatics class and im struggling with an assignment. So the gist is I need to write a program that reads a pdb flat file ( and I want to essentially turn it into a FASTA file as well as calculate the geometric center of the atoms using the xyz coordinates the file provides. Now the first part, I have no trouble with. This is the code I have tried so far:

$k=0; open (IN, '6U9D.pdb.txt'); %amino_acid_conversion = (ALA=>'A',TYR=>'Y',MET=>'M',LEU=>'L',CYS=>'C' +,GLY=>'G', ARG=>'R',ASN=>'N',ASP=>'D',GLN=>'Q',GLU=>'E',HIS=>'H',TRP= +>'W',LYS=>'K',PHE=>'F',PRO=>'P',SER=>'S',THR=>'T',ILE=>'I',VAL=>'V'); while (<IN>) { if ($_=~m/HEADER\s+(.*)/){ print ">$1\n"; } if ($_=~m/^SEQRES\s+\d+\s+\w+\s+\d+\s+(.*)/){ $seq.=$1; $seq=~s/ //g;} } #THIS IS THE PART I CANT SEEM TO FIGURE OUT for ($i=0;$i<=length $seq; $i+=3) { print "$amino_acid_conversion{substr($seq,$i,3)}"; #All seems well wit +h the sequence I get the desired output $k++;} print "\n"; if ($_=~m/^ATOM\s+\d+\s+\w+\s+\w+\s+\w+\d+\s+(\S+)\s+(\S+)\s+(\S+)/){ + #the parentheses are the xyz coordinates respectively $x+=$1; $y+=$2; $z+=$3; } $xgk=($x/$k); $ygk=($y/$k); $zgk=($z/$k); print "$xgk $ygk $zgk \n";

The total outpud I get is the sequence and three zeroes. It seems like its not even reading the coordinates? Not sure there... If any of you can share your wisdom I'd appreciate it! Oh and since this is an assignment please use the commands I have here, since this is what I have been taught so far.

Replies are listed 'Best First'.
Re: Issues regarding for loops and recursion
by choroba (Cardinal) on Jan 09, 2023 at 12:44 UTC
    You are missing a \s+ between a \w+ and \d+. Also, the whole part needs to be moved inside the loop.

    #! /usr/bin/perl use warnings; use strict; open my $in, '<', '6U9D.pdb' or die $!; my %amino_acid_conversion = (ALA => 'A', TYR => 'Y', MET => 'M', LEU = +> 'L', CYS => 'C', GLY => 'G', ARG => 'R', ASN = +> 'N', ASP => 'D', GLN => 'Q', GLU => 'E', HIS = +> 'H', TRP => 'W', LYS => 'K', PHE => 'F', PRO = +> 'P', SER => 'S', THR => 'T', ILE => 'I', VAL = +> 'V'); my ($x, $y, $z) = (0, 0, 0); my $seq; while (<$in>) { if (/HEADER\s+(.*)/) { print ">$1\n"; } if (/^SEQRES\s+\d+\s+\w+\s+\d+\s+(.*)/) { $seq .= $1; } if (/^ATOM\s+\d+\s+\w+\s+\w+\s+\w+\s+\d+\s+(\S+)\s+(\S+)\s+(\S+)/) + { # ~~~ $x += $1; $y += $2; $z += $3; } } $seq =~ s/ //g; my $k = 0; for (my $i = 0; $i < length $seq; $i += 3) { my $triplet = substr $seq, $i, 3; if (exists $amino_acid_conversion{$triplet}) { print "$amino_acid_conversion{$triplet}"; } else { warn "Don't know how to convert '$triplet'.\n"; } $k++; } print "\n"; my $xgk = $x / $k; my $ygk = $y / $k; my $zgk = $z / $k; print "$xgk $ygk $zgk \n";

    Note that I also turned strict and warnings on, used the 3-argument version of open without bareword filehandles, and checked its return value.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Issues regarding for loops and recursion
by Corion (Patriarch) on Jan 09, 2023 at 12:44 UTC

    Hi Nickmofoe, and welcome to Perl!

    if ($_=~m/^ATOM\s+\d+\s+\w+\s+\w+\s+\w+\d+\s+(\S+)\s+(\S+)\s+(\S+)/){

    Here you match against $_, but it was never set. Did you want to match against $seq? I'm not sure.

    Some general stylistic things:

    open (IN, '6U9D.pdb.txt');

    You should check whether opening the file actually succeeded:

    my $filename = '6U9D.pdb.txt'; open (IN, $filename) or die "Couldn't open '$filename': $!";

    You should use the strict pragma. This requires you to declare all variables but on the upside, it allows Perl to tell you about typos in your variable names.

    When assigning the %amino_acid_conversion, you could use newlines to make the code a bit more readable:

    %amino_acid_conversion = ( ALA=>'A', TYR=>'Y', MET=>'M', LEU=>'L', CYS=>'C', GLY=>'G', ARG=>'R', ASN=>'N', ASP=>'D', GLN=>'Q', GLU=>'E', HIS=>'H', TRP=>'W', LYS=>'K', PHE=>'F', PRO=>'P', SER=>'S', THR=>'T', ILE=>'I', VAL=>'V' );
      When assigning the %amino_acid_conversion, you could use newlines to make the code a bit more readable

      Completely agree. I'd go even further and put whitespace around the fat commas and with so many entries I would sort by either key or value to make it easier to spot misses/dupes/typos.

      %amino_acid_conversion = ( ALA => 'A', ARG => 'R', ASN => 'N', ASP => 'D', CYS => 'C', GLN => 'Q', GLU => 'E', GLY => 'G', HIS => 'H', ILE => 'I', LEU => 'L', LYS => 'K', MET => 'M', PHE => 'F', PRO => 'P', SER => 'S', THR => 'T', TRP => 'W', TYR => 'Y', VAL => 'V' );


      Another initialization readability trick is to use YAML in a HEREDOC which gets you (IMHO) less "noise" to initializing data:

      use YAML::XS qw( Load ); my %conversions = %{Load(<<'EOT')}; --- ALA: A TYR: Y MET: M ... EOT

      The cake is a lie.
      The cake is a lie.
      The cake is a lie.

Re: Issues regarding for loops and recursion
by Marshall (Canon) on Jan 12, 2023 at 12:38 UTC
    I see some trouble with your code to parse out the sequences from SEQRES lines. There is some redundant information in the .pdb format. I used the sequential letter (A,B,C,D,E,etc) to delineate start/end of the various sequences. Number of 3 letter groups in the sequence would be another way. I am not sure what you are trying to do. At the end of code, see a couple of example FASTA segments I cut n' pasted from the program output.

    I have no ide what to do with the x,y,z atom coordinates or even what they mean. I averaged them for fun. I changed your regex a bit to make it easier to pick out the +-floating point numbers.

    To run code, download the .pdb file and name it "6U9D.pdb". In production code, I probably would not use so much memory for temporary results, preferring to calculate and output results as they occur rather than save a bunch of data and process it at the end. However for these prototype things, having the data in intermediate forms can useful for debugging. Have fun.

    use strict; use warnings; open my $pdb_fh, '<', "6U9D.pdb" or die "unable to open 6U9D.pdb for r +eading $!"; my %amino_acid_conversion = ( ALA => 'A', ARG => 'R', ASN => 'N', ASP => 'D', CYS => 'C', GLN => 'Q', GLU => 'E', GLY => 'G', HIS => 'H', ILE => 'I', LEU => 'L', LYS => 'K', MET => 'M', PHE => 'F', PRO => 'P', SER => 'S', THR => 'T', TRP => 'W', TYR => 'Y', VAL => 'V'); my @atoms; #2D array of x,y,z of all atoms (many thousands) my @seqs; #raw sequences with the 3 letter designations #convert later to FASTA format my $cur_seq = ''; my $cur_ltr = ''; # parse out data of interest from file # while (<$pdb_fh>) { if (my ($ltr, $seq) = (/^SEQRES\s+\d+\s+(\w+)\s+\d+\s+([A-Z ]+)$/) +) { $seq =~ s/\s+$//; if ($ltr eq $cur_ltr) { $cur_seq .= " $seq"; } else { push @seqs, $cur_seq if $cur_seq ne ''; # end of current seq +uence $cur_seq = $seq; # begin of the next +sequence $cur_ltr = $ltr; } } elsif (my ($x,$y,$z) = (/^ATOM\s+.*?([\d.-]+)\s+([\d.-]+)\s+([\d.-] ++)/) ) { push @atoms, [$x,$y,$z]; } } push @seqs, $cur_seq; # don't forget to finish the last seq! #### output collected data ### #make a fasta sequence segments foreach my $seq (@seqs) { # my $fasta = join '',map{$amino_acid_conversion{$_}}split ' ',$seq +; # without using a map: # my $fasta =''; foreach my $char3 (split ' ',$seq) { $fasta.= $amino_acid_conversion{$char3} } print ">Some Fasta Description Line\n"; #use 60 char lines while ($fasta) #fasta suggested max is 80 { print substr($fasta,0,60,''),"\n"; } } #print the data points # I am not sure what needs to be done with them # average of each coordinate? #foreach my $row_ref (@atoms) #uncomment to print #{ # print @$row_ref,"\n"; #} my $xsum; my $ysum; my $zsum; foreach my $row_ref (@atoms) # @atoms is a 2D array { my ($x, $y , $z ) = @$row_ref; $xsum+=$x; $ysum+=$y; $zsum+=$z; } print "avg x = ",$xsum/@atoms,"\n"; print "avg y = ",$ysum/@atoms,"\n"; print "avg z = ",$zsum/@atoms,"\n"; __END__ These are 2 examples: You will have to figure out what goes in the FASTA description line And perhaps not all of these sequences are relevant? Looks like a lot are duplicates. >Some Fasta Description Line MHHHHHHENLYFQGAPSFNVDPLEQPAEPSKLAKKLRAEPDMDTSFVGLTGGQIFNEMMS RQNVDTVFGYPGGAILPVYDAIHNSDKFNFVLPKHEQGAGHMAEGYARASGKPGVVLVTS GPGATNVVTPMADAFADGIPMVVFTGQVPTSAIGTDAFQEADVVGISRSCTKWNVMVKSV EELPLRINEAFEIATSGRPGPVLVDLPKDVTAAILRNPIPTKTTLPSNALNQLTSRAQDE FVMQSINKAADLINLAKKPVLYVGAGILNHADGPRLLKELSDRAQIPVTTTLQGLGSFDQ EDPKSLDMLGMHGCATANLAVQNADLIIAVGARFDDRVTGNISKFAPEARRAAAEGRGGI IHFEVSPKNINKVVQTQIAVEGDATTNLGKMMSKIFPVKERSEWFAQINKWKKEYPYAYM EETPGSKIKPQTVIKKLSKVANDTGRHVIVTTGVGQHQMWAAQHWTWRNPHTFITSGGLG TMGYGLPAAIGAQVAKPESLVIDIDGDASFNMTLTELSSAVQAGTPVKILILNNEEQGMV TQWQSLFYEHRYSHTHQLNPDFIKLAEAMGLKGLRVKKQEELDAKLKEFVSTKGPVLLEV EVDKKVPVLPMVAGGSGLDEFINFDPEVERQQTELRHKRTGGKH >Some Fasta Description Line MGSSHHHHHHSSGLVPRGSHMENLYFQGATRPPLPTLDTPSWNANSAVSSIIYETPAPSR QPRKQHVLNCLVQNEPGVLSRVSGTLAARGFNIDSLVVCNTEVKDLSRMTIVLQGQDGVI EQARRQIEDLVPVYAVLDYTNSEIIKRELVMARISLLGTEYFEDLLLHHHTSTNAGAADS QELVAEIREKQFHPANLPASEVLRLKHEHLNDITNLTNNFGGRVVDISETSCIVELSAKP TRISAFLKLVEPFGVLECARSGMMALPRTPLKTSTEEAADEDEKISEIVDISQLPPG I have no idea what these numbers would mean? avg x = 321.013155298296 avg y = 290.744642162734 avg z = 69.196842162731