Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
I see some trouble with your code to parse out the sequences from SEQRES lines. There is some redundant information in the .pdb format. I used the sequential letter (A,B,C,D,E,etc) to delineate start/end of the various sequences. Number of 3 letter groups in the sequence would be another way. I am not sure what you are trying to do. At the end of code, see a couple of example FASTA segments I cut n' pasted from the program output.

I have no ide what to do with the x,y,z atom coordinates or even what they mean. I averaged them for fun. I changed your regex a bit to make it easier to pick out the +-floating point numbers.

To run code, download the .pdb file and name it "6U9D.pdb". In production code, I probably would not use so much memory for temporary results, preferring to calculate and output results as they occur rather than save a bunch of data and process it at the end. However for these prototype things, having the data in intermediate forms can useful for debugging. Have fun.

use strict; use warnings; open my $pdb_fh, '<', "6U9D.pdb" or die "unable to open 6U9D.pdb for r +eading $!"; my %amino_acid_conversion = ( ALA => 'A', ARG => 'R', ASN => 'N', ASP => 'D', CYS => 'C', GLN => 'Q', GLU => 'E', GLY => 'G', HIS => 'H', ILE => 'I', LEU => 'L', LYS => 'K', MET => 'M', PHE => 'F', PRO => 'P', SER => 'S', THR => 'T', TRP => 'W', TYR => 'Y', VAL => 'V'); my @atoms; #2D array of x,y,z of all atoms (many thousands) my @seqs; #raw sequences with the 3 letter designations #convert later to FASTA format my $cur_seq = ''; my $cur_ltr = ''; # parse out data of interest from file # while (<$pdb_fh>) { if (my ($ltr, $seq) = (/^SEQRES\s+\d+\s+(\w+)\s+\d+\s+([A-Z ]+)$/) +) { $seq =~ s/\s+$//; if ($ltr eq $cur_ltr) { $cur_seq .= " $seq"; } else { push @seqs, $cur_seq if $cur_seq ne ''; # end of current seq +uence $cur_seq = $seq; # begin of the next +sequence $cur_ltr = $ltr; } } elsif (my ($x,$y,$z) = (/^ATOM\s+.*?([\d.-]+)\s+([\d.-]+)\s+([\d.-] ++)/) ) { push @atoms, [$x,$y,$z]; } } push @seqs, $cur_seq; # don't forget to finish the last seq! #### output collected data ### #make a fasta sequence segments foreach my $seq (@seqs) { # my $fasta = join '',map{$amino_acid_conversion{$_}}split ' ',$seq +; # without using a map: # my $fasta =''; foreach my $char3 (split ' ',$seq) { $fasta.= $amino_acid_conversion{$char3} } print ">Some Fasta Description Line\n"; #use 60 char lines while ($fasta) #fasta suggested max is 80 { print substr($fasta,0,60,''),"\n"; } } #print the data points # I am not sure what needs to be done with them # average of each coordinate? #foreach my $row_ref (@atoms) #uncomment to print #{ # print @$row_ref,"\n"; #} my $xsum; my $ysum; my $zsum; foreach my $row_ref (@atoms) # @atoms is a 2D array { my ($x, $y , $z ) = @$row_ref; $xsum+=$x; $ysum+=$y; $zsum+=$z; } print "avg x = ",$xsum/@atoms,"\n"; print "avg y = ",$ysum/@atoms,"\n"; print "avg z = ",$zsum/@atoms,"\n"; __END__ These are 2 examples: You will have to figure out what goes in the FASTA description line And perhaps not all of these sequences are relevant? Looks like a lot are duplicates. >Some Fasta Description Line MHHHHHHENLYFQGAPSFNVDPLEQPAEPSKLAKKLRAEPDMDTSFVGLTGGQIFNEMMS RQNVDTVFGYPGGAILPVYDAIHNSDKFNFVLPKHEQGAGHMAEGYARASGKPGVVLVTS GPGATNVVTPMADAFADGIPMVVFTGQVPTSAIGTDAFQEADVVGISRSCTKWNVMVKSV EELPLRINEAFEIATSGRPGPVLVDLPKDVTAAILRNPIPTKTTLPSNALNQLTSRAQDE FVMQSINKAADLINLAKKPVLYVGAGILNHADGPRLLKELSDRAQIPVTTTLQGLGSFDQ EDPKSLDMLGMHGCATANLAVQNADLIIAVGARFDDRVTGNISKFAPEARRAAAEGRGGI IHFEVSPKNINKVVQTQIAVEGDATTNLGKMMSKIFPVKERSEWFAQINKWKKEYPYAYM EETPGSKIKPQTVIKKLSKVANDTGRHVIVTTGVGQHQMWAAQHWTWRNPHTFITSGGLG TMGYGLPAAIGAQVAKPESLVIDIDGDASFNMTLTELSSAVQAGTPVKILILNNEEQGMV TQWQSLFYEHRYSHTHQLNPDFIKLAEAMGLKGLRVKKQEELDAKLKEFVSTKGPVLLEV EVDKKVPVLPMVAGGSGLDEFINFDPEVERQQTELRHKRTGGKH >Some Fasta Description Line MGSSHHHHHHSSGLVPRGSHMENLYFQGATRPPLPTLDTPSWNANSAVSSIIYETPAPSR QPRKQHVLNCLVQNEPGVLSRVSGTLAARGFNIDSLVVCNTEVKDLSRMTIVLQGQDGVI EQARRQIEDLVPVYAVLDYTNSEIIKRELVMARISLLGTEYFEDLLLHHHTSTNAGAADS QELVAEIREKQFHPANLPASEVLRLKHEHLNDITNLTNNFGGRVVDISETSCIVELSAKP TRISAFLKLVEPFGVLECARSGMMALPRTPLKTSTEEAADEDEKISEIVDISQLPPG I have no idea what these numbers would mean? avg x = 321.013155298296 avg y = 290.744642162734 avg z = 69.196842162731

In reply to Re: Issues regarding for loops and recursion by Marshall
in thread Issues regarding for loops and recursion by Nickmofoe

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?

What's my password?
Create A New User
Domain Nodelet?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (3)
As of 2023-12-09 07:46 GMT
Find Nodes?
    Voting Booth?
    What's your preferred 'use VERSION' for new CPAN modules in 2023?

    Results (37 votes). Check out past polls.