Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Problem computing GC content

by zuepinhel (Initiate)
on Jul 13, 2014 at 17:26 UTC ( [id://1093451]=perlquestion: print w/replies, xml ) Need Help??

zuepinhel has asked for the wisdom of the Perl Monks concerning the following question:

Hey guys! I'm new to Perl, and on my first assignment i've been developing a program to calculate the GC content of DNA sequences on a txt file. The file has +10 sequences and i'm only getting the result of the last one. Is there some problem with the loops ? The code is as follows. Thanks!

#!/usr/local/bin/perl open (FASTAFILE, 'human_fold_25.txt'); $sequence = ' '; $Ccount = 0; $Gcount = 0; $identifier = ' '; while ($line = <FASTAFILE>) { chomp $line; if ( $line =~ /^>/ ) { $identifier = $line; } elsif ( $line =~ /^\./ ) { next; } elsif ( $line =~ /^\(/ ) { }else { $sequence = $line; } } $sequencelength = length ($sequence); @nucleotides = split ( '' , $sequence ); foreach $nuc (@nucleotides) { if ( $nuc eq 'G') { $Gcount = $Gcount + 1; } elsif ( $nuc eq 'C') { $Ccount = $Ccount + 1; } } $GCcontent = ((( $Gcount + $Ccount )/ $sequencelength ) * 100); close (FASTAFILE); print "$identifier", ", ", "$GCcontent\n"

Replies are listed 'Best First'.
Re: Problem computing GC content
by AppleFritter (Vicar) on Jul 13, 2014 at 17:37 UTC

    Howdy zuepinhel, welcome to the Monastery!

    Take a look at your initial while loop. You're assigning to $sequence (and $identifier) in each iteration, so when the loop finishes, $sequence will only contain the last line read, and all the rest of your script will only operate on that.

    Try pulling the code following the loop into the loop -- except for closing the file, obviously.

      Can you be more precise ? I'm new to this. What do you propose i do ?

        Certainly. Something like the following:

        #!/usr/local/bin/perl use feature qw/say/; use strict; use warnings; open my $FASTAFILE, "<", 'human_fold_25.txt' or die "Could not open file: $!"; my $sequence = ' '; my $Ccount = 0; my $Gcount = 0; my $identifier = ' '; while(my $line = <$FASTAFILE>) { chomp $line; next if $line =~ m/^\./; if ($line =~ /^>/) { $identifier = $line; } elsif ($line !~ /^\(/) { $sequence = $line; $Gcount = $Ccount = 0; my $sequencelength = length $sequence; my @nucleotides = split('', $sequence); foreach my $nuc (@nucleotides) { if ($nuc eq 'G') { $Gcount = $Gcount + 1; } elsif ($nuc eq 'C') { $Ccount = $Ccount + 1; } } my $GCcontent = ((($Gcount + $Ccount) / $sequencelength) * 100 +); say "$identifier, $GCcontent"; } } close $FASTAFILE or die "Could not close file: $!";

        Note I've made a few other changes to your code, namely:

        • use strict;, and use warnings;. These are your friends; they'll catch many errors for you, and you should always use them. As a side effect, I've also declared variables with my where necessary.
        • use feature qw/say/; it's nicer than print if you want to add a newline anyway, IMO.
        • Lexical filehandles ($FASTAFILE vs. FASTAFILE).
        • Three-argument form of open.
        • Error checking for open and close.
        • Slightly more idiomatic handling of lines starting with periods.
        • Whitespace changes, to improve readability.

        Your code could still be tightened up further, of course. For instance, the entire block following elsif ($line !~ /^\(/) could be reduced to the following:

        ... } elsif ($line !~ /^\(/) { my $GCcount =()= ($line =~ m/(G|C)/g); say "$identifier, " , ($GCcount / length($line) * 100); } ...

        but since you're still new to Perl, I imagine that wouldn't make as much sense. :)

        Look through the code you posted and identify where the while loop ends. Only code inside the loop gets executed repeatedly, so move that loop-ending point after all the lines of code that must run once per loop.

        Here are two good exercises for getting started with any programming languages. First, before you even try running a script, make your eyes run it. Read through the lines. See where decisions are, and imagine you are the computer. Choose the if or the else, and run it in your mind. When you see a while or a foreach loop, read through the lines, then back up to the top, and back around a few times. Then imagine you've reached the end of the loop and go on to the rest of the code. This helps give you a feel for what the machine does when it reads your code.

        Second, experiment with very simple throw-away scripts. Try different kinds of loops, put a few print statements in different places so you can see when, and how often, they print out when you run the script. If you type something that's totally non-working, erase it and do something else. No harm done (assuming you don't tell Perl to delete all your files or something like that).

        When you get an error, use Perl Monks' Super Search or Google to see what others have done. 95% of the time you'll find somebody has had your problem before and found a solution.

        (edit: I'm too slow, you've got a detailed reply. I still recommend trying these two exercises often. They'll serve you well.)
Re: Problem computing GC content
by AnomalousMonk (Archbishop) on Jul 13, 2014 at 20:57 UTC

    AppleFritter seems to have given you a good solution to the specific question you asked. However, I would suggest you become familiar with the string manipulation facilities of Perl.

    Specifically for this problem, the tr operator would seem very useful. Please see its discussion via the tr link and in the Quote-Like Operators section of perlop.

    c:\@Work\Perl>perl -wMstrict -le "my $sequence = 'xxxGCxxCxxxxCGxGxxxGxxxxxCxx'; ;; my $sequencelength = length $sequence; print qq{sequence length: $sequencelength}; ;; my $GCcount = $sequence =~ tr/GC//; print qq{total Gs and Cs: $GCcount}; ;; my $GCcontent = ($GCcount / $sequencelength) * 100; print qq{GCcontent: $GCcontent percent} " sequence length: 28 total Gs and Cs: 8 GCcontent: 28.5714285714286 percent

      But it doesn't solve the problem because it continues on computing just one sequence.

        It wasn't intended to solve "the problem". AppleFritter has already offered what seems an adequate solution to the problem — although I must admit that my understanding of the problem may be very defective.

        What I wanted to offer was an approach to solving a portion of the problem: quickly counting the number of 'G' and 'C' characters in a string. One very Perlish way to do something like this is by using the tr operator.


        Data! SHOW US THE DATA!


        Questions containing the words "doesn't work" (or their moral equivalent) will usually get a downvote from me unless accompanied by:
        1. code
        2. verbatim error and/or warning messages
        3. a coherent explanation of what "doesn't work actually means.

        check Ln42!

Re: Problem computing GC content
by Kenosis (Priest) on Jul 14, 2014 at 17:03 UTC

    Here's another option:

    use strict; use warnings; local $/ = '>'; while (<>) { chomp; /\S/ or next; my ( $id, $seq ) = /(.+?)\n(.+)/s; $seq =~ s/\n//g; my $GCcount = $seq =~ tr/GC//; printf "%s, %.2f%%\n", ">$id", ( $GCcount / length $seq ) * 100; }

    Command-line usage: perl script.pl fastaFile [>outfile]

    The last, optional parameter directs output to a file.

    Sample FASTA record:

    >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX IENY

    Output on that sample FASTA record:

    >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus], 7.3 +9%

    Since FASTA files use ">" as the record separator, the script sets Perl's record separator to that character, so the file's read one FASTA record at a time. After a read, chomp removes that record separator.

    Next, /\s/ tests to see that there are characters to parse and, if not, the next record is read. Using a regex, the id and seq are captured. A substitution is used to remove any newlines, so an accurate length count can be made. (The sequence in FASTA records may be across multiple lines; this script will handle these.)

    As AnomalousMonk demonstrated, Perl's transliteration operator can be used to get the GC character count within the sequence. Finally, the results are printed using printf.

    Hope this helps!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1093451]
Approved by AppleFritter
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (8)
As of 2024-04-24 12:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found