http://qs321.pair.com?node_id=11133124

cosmin has asked for the wisdom of the Perl Monks concerning the following question:

Hello everyone! I have to extract the CDS zones from one chromosome file to an secondary text file. How I manage to extract all of them ? I've read that these zones are considered as tags, and their join complement as secondary tags... Also, i have to list the ORIGIN zone, but i have to cross over the nucleotide string that fix with the one from the CDS and then to write it in the secondary text file bellow the CDS. Can somebody help me handle with this task? I'm newbie in perl..

This is my task: You must to design a program capable of extracting only the CDS sections of such a file, which are described in the FEATURES section, to which should be added their corresponding nucleotide sequences described in the ORIGIN section, thus creating a new .txt file with a much simpler structure. The designed program must extract from the original file all portions of the CDS with their description to which it must add, by selective extraction from the ORIGIN section, the corresponding nucleotide sequence, thus creating a new .txt file with, in order, only the descriptions of CDS in which the corresponding nucleotide sequences appear.

I have wrote this code, but I don't know if it is really good, I think it needs more improvement.. but I'm still stucked here

#!/usr/bin/perl -w use warnings; use warnings FATAL => q{void}; use warnings FATAL => 'syntax'; use strict; use Data::Dumper; print "Enter your chromosome"; my $chromosome = <STDIN>; chomp $chromosome; print "Your file is '$chromosome'\n"; # reading whole file $file_name = $ARGV[0]; open(my $file, '<', $file_name) or die "Sorry, we can't open your $fil +e_name $!"; @content= <$file>; close $file; print "\n\n"; $index_max = $#content; for ($start=0; $start <= $index_max; $start++) { chomp $content[$i]; print "Reg $i: $content[$i]\n"; if ($content[$start] =~ /CDS/ ) { print "There is CDS \n"; for ($start;$start <= $index_max; $start++) { @cds = (@cds, $content[$start]); print @cds; } open (WRITE, ">>concatenare.txt"); print WRITE @cds; close WRITE; } #This chromosome have tags. Identify CDS ones. my @features =$mySeq ->all_seqfeatures(); foreach $feature (@features) { my @tag = $CDS; my @feat = $join; foreach $tag ( $feat->all tags() ); print "Feature region has tag", $tag, "CDS", join(‘ ‘,$feat->each tag value($tag)), "\n"; } if ($content[$start] =~ /ORIGIN/ ) { print "There is ORIGIN !! \n"; for ($start;$start <= $index_max; $start++) { @origin = (@origin, $content[$start]); shift (@origin); #Delete ORIGIN row; print @origin; for @origin { while ($_ =~ m/[ACGTURYKMSWBDHVN]/ig) { $seq = $seq.$&; } } $lenght = length($seq); print "The sequence has $nucleotide length\n"; print "$seq"; } } } print "There is ",$index_max," inregistrari\n"; print "This are:\n\n"; my $file = "concatenare.txt"; my $succ = open( my $fh , '>>', $file ); $fh = *STDOUT unless $succ; print $fh "CDS1 \n"; print $fh "Nucleotide sequence \n"; print $fh "CDS2 \n"; print $fh "Nucleotide sequence \n";
  • Comment on Stuck in Perl.. Partial code, but it needs more improvement.. Any suggestions, please?
  • Download Code

Replies are listed 'Best First'.
Re: Stuck in Perl.. Partial code, but it needs more improvement.. Any suggestions, please?
by Fletch (Bishop) on May 27, 2021 at 13:01 UTC

    Prossibly not useable (this has the stench of homework about it) but considering the problem domain BioPerl would be the first thing to check out. Even if it is homework you're supposed to build from scratch you may be able to gain inspiration, but judging by the orphaned method call on a $mySeq instance its use could be expected.

    And mentioning because it jumped out at me: ALWAYS CHECK THE RETURN FROM open CALLS. You did your first but you didn't creating some of your output file(s) which means that's inevitably going to silently fail some time and you're then going to be stuck trying to track down what's gone wrong when that happens (and it will; Murphy sez so). Even your trick conditionally opening the last output it might be useful to print an informational diagnostic message to indicate why you're not writing to the file rather than STDOUT.

    if( not $succ ) { print STDERR qq{Unable to append to '$file', using STDOUT: $!\n}; $fh = \*STDOUT; }

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

      > ALWAYS CHECK THE RETURN FROM open ...

      The typical idiom for this is,

      open (my $fh, $mode, $file) || die $!;

      This also mitigates the need for an -e check for read $mode that expect $file exist, particularly useful inside of an eval block or Try::Tiny construct; so that you can handle it.

      Update, fixed precedence. Point remains the same.

        Check your precedence. The code you showed does not do what you think.

        $ perl -Mstrict -wE 'say -e "fooble" ? 1 : 0; open my $fh, "<", "foobl +e" || die $!' 0
        $ perl -Mstrict -wE 'say -e "fooble" ? 1 : 0; open my $fh, "<", "foobl +e" or die $!' 0 No such file or directory at -e line 1.


        The way forward always starts with a minimal test.
Re: Stuck in Perl.. Partial code, but it needs more improvement.. Any suggestions, please?
by cavac (Parson) on May 27, 2021 at 12:36 UTC

    Very few people here are doctors in medicine. I certainly am not. I know how to munch data files, but my knowledge of chromosomes extend as far as "they exist" and "i have some of those".

    Could you provide a short example file(s) and a (hand crafted) file of the expected output? Also, a non-scientist explanation for us software developers in the form of "column A of this file example1 needs to get concatenated to column B of file example2 if column C in example2 matches part of column X in example2". E.g. a bit more abstracted/technical without requiring 10 years of university work.

    perl -e 'use Crypt::Digest::SHA256 qw[sha256_hex]; print substr(sha256_hex("the Answer To Life, The Universe And Everything"), 6, 2), "\n";'
Re: Stuck in Perl.. Partial code, but it needs more improvement.. Any suggestions, please?
by glycine (Sexton) on May 27, 2021 at 15:12 UTC

    hello! becuase you mentioned about "CDS", "original", so I guess you want parser information from a genbank format file, right? (e.g. https://www.ncbi.nlm.nih.gov/nuccore/M85050.1) as cavac say, if you do not offer file format, other people can't understand what do you want to do, and then understand wether the code work good. so I can't offer how to make your code work correctly, too.

    there are some part of your maybe need change, I just try to give some advice:)

    @cds = (@cds, $content[$start]); print @cds;

    before this paragraph, @cd did not defined in anywher, so this will make script stop work, and because it assigned by itself, I can't understand purpose of this statements, so I don't know how to modify it, but I sure it won't work now.

    for ($start;$start <= $index_max; $start++) { @origin = (@origin, $content[$start]); shift (@origin); #Delete ORIGIN row; print @origin; for @origin {

    this part have same problem, what is @origin?

    #This chromosome have tags. Identify CDS ones. my @features =$mySeq ->all_seqfeatures(); foreach $feature (@features) { my @tag = $CDS; my @feat = $join; foreach $tag ( $feat->all tags() ); print "Feature region has tag", $tag, "CDS", join(‘ ‘,$feat->each tag value($tag)), "\n"; }

    I guess you defined subroutines in other site, so I don't what can they do, ;-;.

    open(my $file, '<', $file_name) or die "Sorry, we can't open your $fil +; open (WRITE, ">>concatenare.txt");

    you used two style of open() in this code, first is better, please use it every times.

    you can use  while(my $line = <$fh>){...}, this will read one line of file in each times. so you don't need read all content into a array, and don't need to find the last index before. if you actually treat a very big file, it will reduce ram occupied by program.

    p.s. yes, Bioperl can resolve most of problem.

Re: Stuck in Perl.. Partial code, but it needs more improvement.. Any suggestions, please?
by GrandFather (Saint) on Jun 01, 2021 at 00:14 UTC

    We like to see strictures (use strict; use warnings;), but it looks like you added them as boiler plate to make us happy. As Marshall hints you can not possibly have run the code as presented. In fact even with strictures removed there are still many syntax errors that prevent the code from running. You have presented us with a steaming pile of dog turds so you haven't received very much help. As a first step you need to present code that actually runs and for very strong preference is self contained (see I know what I mean. Why don't you? for how you might do that).

    At this stage tools such as BioPerl are likely to be a distraction until you get some basic Perl squared away. We can help with your Perl understanding given a sound starting point (so that we can understand where you are having trouble). Fewer of us can help with the biological component of your problem and none of us should do your homework for you.

    Bottom line: you need to get a stand alone (no external files) test version of your code going at least enough that you can come back to us with your Perl coding problems.

    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
      "steaming pile of dog turds" is harsh but accurate.

      I agree with this:
      Bottom line: you need to get a stand alone (no external files) test version of your code going at least enough that you can come back to us with your Perl coding problems.

      ADDED: I am against describing code in terms of animal excrement in public even if that is what it is.

Re: Stuck in Perl.. Partial code, but it needs more improvement.. Any suggestions, please?
by Marshall (Canon) on May 31, 2021 at 23:43 UTC
    Your problem description didn't mention how your code fails. So, my first step was to add a line "exit;" right at the start of your code and then try to run the code. I put that exit statement in just to see compile errors. Your code does not come anywhere close to compiling. Perl gives up after the first 34 compile errors!

    I would strongly recommend investigating BioPerl. Due to the regex matching capabilities of Perl, Perl excels at DNA bio stuff.

    You should be aware that a FORTRAN or C style loop, eg. for ($start=0; $start <= $index_max; $start++) is rare in Perl. While this is legal syntax, the very most common error in programming is the "off by one error". Perl can iterate over an array without using index values.

    Again, I recommend that you investigate bioperl. Without any experience or proof, I HIGHLY suspect that parsing a record like shown below is a basic function of that universe of programs and modules.

    I recoded part of what I think you are trying to do below.
    I did not code the option to eliminate the "sig_peptide" line from @CDS.
    See Flipin good, or a total flop? for more details of how to do that.

    Without more data like shown below, I am unable to help further.

    use strict; use warnings; $|=1; my @CDS; my @ORIGIN; while (my $line =<DATA>) { next unless ($line =~ /\S/); #skip blank data lines push (@CDS,$line) if ( $line =~ /^\s+CDS/ ... $line =~/^\s*[a-z]+/ +); push (@ORIGIN, $line) if ($line =~ /^\s*ORIGIN/...$line !~/\s*\d/); + } print "CDS section:\n"; print @CDS; print "\nORIGIN section:\n"; print @ORIGIN; =PRINTS: ########## a copy from my command line output #### CDS section: CDS 10..1011 /gene="hemoglobin" /codon_start=1 /product="hemoglobin" /protein_id="AAA29796.1" /translation="MHSSIVLATVLFVAIASASKTRELCMKSLEHAKVG +TSKEAKQDG IDLYKHMFEHYPAMKKYFKHRENYTPADVQKDPFFIKQGQNILLACHVL +CATYDDRET FDAYVGELMARHERDHVKVPNDVWNHFWEHFIEFLGSKTTLDEPTKHAW +QEIGKEFSH EISHHGRHSVRDHCMNSLEYIAIGDKEHQKQNGIDLYKHMFEHYPHMRK +AFKGRENFT KEDVQKDAFFVNKDTRFCWPFVCCDSSYDDEPTFDYFVDALMDRHIKDD +IHLPQEQWH EFWKLFAEYLNEKSHQHLTEAEKHAWSTIGEDFAHEADKHAKAEKDHHE +GEHKEEHH" sig_peptide 10..63 ORIGIN section: ORIGIN 1 ggaaccatta tgcactcttc aatagttttg gccaccgtgc tctttgtagc gattg +cttca 61 gcatcaaaaa cgcgagagct atgcatgaaa tcgctcgagc atgccaaggt tggca +ccagc =cut ## Data is abbreviated from https://www.ncbi.nlm.nih.gov/nuccore/M8505 +0.1 ## example URL courtesty of [glycine] __DATA__ COMMENT Original source text: Pseudoterranova decipiens larval cDN +A to mRNA. FEATURES Location/Qualifiers source 1..1353 /organism="Pseudoterranova decipiens" /mol_type="mRNA" /db_xref="taxon:6271" /dev_stage="larval" gene 1..1353 /gene="hemoglobin" 5'UTR 1..9 /gene="hemoglobin" CDS 10..1011 /gene="hemoglobin" /codon_start=1 /product="hemoglobin" /protein_id="AAA29796.1" /translation="MHSSIVLATVLFVAIASASKTRELCMKSLEHAKVG +TSKEAKQDG IDLYKHMFEHYPAMKKYFKHRENYTPADVQKDPFFIKQGQNILLACHVL +CATYDDRET FDAYVGELMARHERDHVKVPNDVWNHFWEHFIEFLGSKTTLDEPTKHAW +QEIGKEFSH EISHHGRHSVRDHCMNSLEYIAIGDKEHQKQNGIDLYKHMFEHYPHMRK +AFKGRENFT KEDVQKDAFFVNKDTRFCWPFVCCDSSYDDEPTFDYFVDALMDRHIKDD +IHLPQEQWH EFWKLFAEYLNEKSHQHLTEAEKHAWSTIGEDFAHEADKHAKAEKDHHE +GEHKEEHH" sig_peptide 10..63 /gene="hemoglobin" mat_peptide 64..1008 /gene="hemoglobin" /product="hemoglobin" misc_feature 696..737 /gene="hemoglobin" /phenotype="'altered epitope (frameshift)'" 3'UTR 1013..1353 /gene="hemoglobin" ORIGIN 1 ggaaccatta tgcactcttc aatagttttg gccaccgtgc tctttgtagc gattg +cttca 61 gcatcaaaaa cgcgagagct atgcatgaaa tcgctcgagc atgccaaggt tggca +ccagc