Problem computing GC content

zuepinhel has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Problem computing GC content by AppleFritter (Vicar) on Jul 13, 2014 at 17:37 UTC
Howdy zuepinhel, welcome to the Monastery! Take a look at your initial `while` loop. You're assigning to `$sequence` (and `$identifier`) in each iteration, so when the loop finishes, `$sequence` will only contain the last line read, and all the rest of your script will only operate on that. Try pulling the code following the loop into the loop -- except for closing the file, obviously.	[reply]
Re^2: Problem computing GC content by zuepinhel (Initiate) on Jul 13, 2014 at 17:46 UTC
Can you be more precise ? I'm new to this. What do you propose i do ?	[reply]
Re^3: Problem computing GC content by AppleFritter (Vicar) on Jul 13, 2014 at 18:02 UTC
Certainly. Something like the following: #!/usr/local/bin/perl use feature qw/say/; use strict; use warnings; open my $FASTAFILE, "<", 'human_fold_25.txt' or die "Could not open file: $!"; my $sequence = ' '; my $Ccount = 0; my $Gcount = 0; my $identifier = ' '; while(my $line = <$FASTAFILE>) { chomp $line; next if $line =~ m/^\./; if ($line =~ /^>/) { $identifier = $line; } elsif ($line !~ /^\(/) { $sequence = $line; $Gcount = $Ccount = 0; my $sequencelength = length $sequence; my @nucleotides = split('', $sequence); foreach my $nuc (@nucleotides) { if ($nuc eq 'G') { $Gcount = $Gcount + 1; } elsif ($nuc eq 'C') { $Ccount = $Ccount + 1; } } my $GCcontent = ((($Gcount + $Ccount) / $sequencelength) * 100 +); say "$identifier, $GCcontent"; } } close $FASTAFILE or die "Could not close file: $!"; [download] Note I've made a few other changes to your code, namely: `use strict;`, and `use warnings;`. These are your friends; they'll catch many errors for you, and you should always use them. As a side effect, I've also declared variables with `my` where necessary. `use feature qw/say/`; it's nicer than `print` if you want to add a newline anyway, IMO. Lexical filehandles (`$FASTAFILE` vs. `FASTAFILE`). Three-argument form of `open`. Error checking for `open` and `close`. Slightly more idiomatic handling of lines starting with periods. Whitespace changes, to improve readability. Your code could still be tightened up further, of course. For instance, the entire block following `elsif ($line !~ /^\(/)` could be reduced to the following: `... } elsif ($line !~ /^\(/) { my $GCcount =()= ($line =~ m/(G\|C)/g); say "$identifier, " , ($GCcount / length($line) * 100); } ...` [download] but since you're still new to Perl, I imagine that wouldn't make as much sense. :)	[reply] [d/l] [select]
Re^4: Problem computing GC content by zuepinhel (Initiate) on Jul 13, 2014 at 18:13 UTC
Re^5: Problem computing GC content by AppleFritter (Vicar) on Jul 13, 2014 at 18:16 UTC
Some notes below your chosen depth have not been shown here
Re^3: Problem computing GC content by Bethany (Scribe) on Jul 13, 2014 at 18:10 UTC
Look through the code you posted and identify where the while loop ends. Only code inside the loop gets executed repeatedly, so move that loop-ending point after all the lines of code that must run once per loop. Here are two good exercises for getting started with any programming languages. First, before you even try running a script, make your eyes run it. Read through the lines. See where decisions are, and imagine you are the computer. Choose the if or the else, and run it in your mind. When you see a while or a foreach loop, read through the lines, then back up to the top, and back around a few times. Then imagine you've reached the end of the loop and go on to the rest of the code. This helps give you a feel for what the machine does when it reads your code. Second, experiment with very simple throw-away scripts. Try different kinds of loops, put a few `print` statements in different places so you can see when, and how often, they print out when you run the script. If you type something that's totally non-working, erase it and do something else. No harm done (assuming you don't tell Perl to delete all your files or something like that). When you get an error, use Perl Monks' Super Search or Google to see what others have done. 95% of the time you'll find somebody has had your problem before and found a solution. (edit: I'm too slow, you've got a detailed reply. I still recommend trying these two exercises often. They'll serve you well.)	[reply] [d/l]
Re^4: Problem computing GC content by zuepinhel (Initiate) on Jul 13, 2014 at 18:19 UTC
Re^5: Problem computing GC content by AnomalousMonk (Archbishop) on Jul 13, 2014 at 20:30 UTC
Re^5: Problem computing GC content by Bethany (Scribe) on Jul 13, 2014 at 18:51 UTC
Re: Problem computing GC content by AnomalousMonk (Archbishop) on Jul 13, 2014 at 20:57 UTC
AppleFritter seems to have given you a good solution to the specific question you asked. However, I would suggest you become familiar with the string manipulation facilities of Perl. Specifically for this problem, the tr operator would seem very useful. Please see its discussion via the tr link and in the Quote-Like Operators section of perlop. `c:\@Work\Perl>perl -wMstrict -le "my $sequence = 'xxxGCxxCxxxxCGxGxxxGxxxxxCxx'; ;; my $sequencelength = length $sequence; print qq{sequence length: $sequencelength}; ;; my $GCcount = $sequence =~ tr/GC//; print qq{total Gs and Cs: $GCcount}; ;; my $GCcontent = ($GCcount / $sequencelength) * 100; print qq{GCcontent: $GCcontent percent} " sequence length: 28 total Gs and Cs: 8 GCcontent: 28.5714285714286 percent` [download]	[reply] [d/l]
Re^2: Problem computing GC content by zuepinhel (Initiate) on Jul 13, 2014 at 21:18 UTC
But it doesn't solve the problem because it continues on computing just one sequence.	[reply]
Re^3: Problem computing GC content by AnomalousMonk (Archbishop) on Jul 13, 2014 at 23:30 UTC
It wasn't intended to solve "the problem". AppleFritter has already offered what seems an adequate solution to the problem — although I must admit that my understanding of the problem may be very defective. What I wanted to offer was an approach to solving a portion of the problem: quickly counting the number of 'G' and 'C' characters in a string. One very Perlish way to do something like this is by using the tr operator.	[reply]
Re^3: Problem computing GC content by ww (Archbishop) on Jul 13, 2014 at 22:51 UTC
Data! SHOW US THE DATA! *Questions containing the words "doesn't work" (or their moral equivalent) will usually get a downvote from me unless accompanied by:* code verbatim error and/or warning messages *a coherent explanation of what "doesn't work* actually means.** check Ln42!	[reply]
Re: Problem computing GC content by Kenosis (Priest) on Jul 14, 2014 at 17:03 UTC
Here's another option: `use strict; use warnings; local $/ = '>'; while (<>) { chomp; /\S/ or next; my ( $id, $seq ) = /(.+?)\n(.+)/s; $seq =~ s/\n//g; my $GCcount = $seq =~ tr/GC//; printf "%s, %.2f%%\n", ">$id", ( $GCcount / length $seq ) * 100; }` [download] Command-line usage: `perl script.pl fastaFile [>outfile]` The last, optional parameter directs output to a file. Sample FASTA record: `>gi\|5524211\|gb\|AAD44166.1\| cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX IENY` [download] Output on that sample FASTA record: `>gi\|5524211\|gb\|AAD44166.1\| cytochrome b [Elephas maximus maximus], 7.3 +9%` [download] Since FASTA files use ">" as the record separator, the script sets Perl's record separator to that character, so the file's read one FASTA record at a time. After a read, `chomp` removes that record separator. Next, `/\s/` tests to see that there are characters to parse and, if not, the next record is read. Using a regex, the id and seq are captured. A substitution is used to remove any newlines, so an accurate length count can be made. (The sequence in FASTA records may be across multiple lines; this script will handle these.) As AnomalousMonk demonstrated, Perl's transliteration operator can be used to get the GC character count within the sequence. Finally, the results are printed using `printf`. Hope this helps!	[reply] [d/l] [select]


Think about Loose Coupling
	PerlMonks