Finding biological palindromes

TJCooper has asked for the wisdom of the Perl Monks concerning the following question:

I've have access to the following script:

#!/usr/bin/perl
$filename = "sample";
open (TEXT, "sample.txt")||die"Cannot";
$line = " ";
$count = 0;
for $n (5..50)
   {
   $re = qr /[CAGT]{$n}/;
   $regexes[$n-5] = $re;
   }
NEXTLINE: while ($count < 1000)
   {
   $line = <TEXT> ;
   $count++;
   foreach my $value (@regexes)
      {
      $start = 0;
      while ($line =~ /$value/g)
         {
         $endline = $';
         $match = $&;
         $revmatch = reverse($match);
         $revmatch =~ tr/CAGT/GTCA/;
         if ($endline =~ /^([CAGT]{0,15})($revmatch)/)
            {
            $start = 1;
            $palindrome = $match . "*" . $1 . "*" . $2;
            $palhash{$palindrome}++;
            }
         }
      if ($start == 0)
         {
         goto NEXTLINE;
         }
      }
   }
open my $out, ">/DIR/results.txt"; 
close TEXT;
while(($key, $value) = each (%palhash))
   {
   print $out "$key => $value\n";
   }
exit;
[download]

Which identifies and outputs identified DNA palindromes. A biological palindrome in DNA is defined as a sequence which when read on both strands in the same direction (5' to 3') is identical:

http://imageshack.com/a/img835/2787/de98.png (as shown by the blue/red regions)

I feel like the above script is rather messy and the output is confusing and unclear. I was wondering if anybody could offer some help, tips, guidance or code itself to accomplish the following:

(1) Identify palindromic DNA sequences

(2) Be able to specify a minimum and maximum length of match

(3) An optional parameter to set whereby only results containing a certain sequence within the length of the palindrome, for example 'GATC', are printed to the output file but where this can also be left blank causing the program to print every single palindrome it finds

(4) The inputs will be DNA sequences of only 1 strand (and not both) - the output needs to be the full palindromic sequence identified for just a single strand - for example in the above photo the input would be:

AGAGGTCAGTCTGCATCGTATCGATCGTCGACGATCGATACGATGCAGACTGACGAGAG

The program would then calculate the other strand's sequence and see if there any palindromes contained within this and if so, output:

GTCAGTCTGCATCGTATCGATCGTCGACGATCGATACGATGCAGACTGAC

Many thanks!

Comment on Finding biological palindromes Download Code

Replies are listed 'Best First'.
Re: Finding biological palindromes by BrowserUk (Patriarch) on May 08, 2014 at 12:47 UTC
Not sure if this is much better than your code: #! perl -slw use strict; my %invert; @invert{ qw[ A C G T ] } = qw[ T G C A ]; my $in = do{ local $/; <DATA> }; chomp $in; print $in; for my $p1 ( 1 .. length( $in ) -2 ) { next unless substr( $in, $p1, 1 ) eq $invert{ substr $in, $p1+1, 1 + }; my $pals = 0; for my $p2 ( 1 .. $p1 -1 ) { last unless substr( $in, $p1-$p2, 1 ) eq $invert{ substr $in, +$p1+$p2+1, 1 }; ++$pals; } if( $pals ) { printf "%s%s at %d\n", ' 'x($p1-$pals), substr( $in, $p1-$pals, ($pals+1)*2 ), $p1-$pals; } } __DATA__ AGAGGTCAGTCTGCATCGTATCGATCGTCGACGATCGATACGATGCAGACTGACGAGAG [download] Produces: `[13:44:54.28] C:\test>1085446 AGAGGTCAGTCTGCATCGTATCGATCGTCGACGATCGATACGATGCAGACTGACGAGAG TGCA at 11 ATCGAT at 19 CGATCG at 21 GTCAGTCTGCATCGTATCGATCGTCGACGATCGATACGATGCAGACTGAC at 4 CGATCG at 31 ATCGAT at 33 TGCA at 43` [download] Modify to taste :) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re: Finding biological palindromes by wjw (Priest) on May 08, 2014 at 12:34 UTC
Have you taken a look at what is on CPAN? I know nothing of this, but simply did a search for "CPAN panindrome" and came up with: This and Another(applicable?) If the output is unclear, clarify it. (Is this your code?). I agree with the previous response in getting rid of the 'goto'. Other than that, if it works... . Honestly, unless this is a code writing exercise, I would look for something already written and freely available. Even if it is an exercise in code writing, look at what has been written and use that as an example. Hope that is helpful... ...the majority is always wrong, and always the last to know about it... Insanity: Doing the same thing over and over again and expecting different results... wjw	[reply]
Re: Finding biological palindromes by Laurent_R (Canon) on May 08, 2014 at 12:04 UTC
I don't have time to dive deep into your code right now, but, at least, change this: `if ($start == 0) { goto NEXTLINE; }` [download] to a next statement with a label: `next NEXTLINE if $start == 0;` [download] See also Dijkstra's famous and classical article Go to statement considered harmful: https://files.ifi.uzh.ch/rerg/arvo/courses/kvse/uebungen/Dijkstra_Goto.pdf.	[reply] [d/l] [select]
Re: Finding biological palindromes by igelkott (Priest) on May 08, 2014 at 21:08 UTC
Perhaps you already know this but DNA palindrome is a fairly casual way of saying inverted repeat. Not trying to be picky but the latter term should help with the search. With this in mind, BioPerl has an interface to the relevant function of the popular (and huge) EMBOSS package. Possibly a bit too narrow but you might also want to take a look at a recent paper which uses a Perl script to find a particular type of inverted repeat (MITE).	[reply]
Re^2: Finding biological palindromes by TJCooper (Beadle) on May 12, 2014 at 09:42 UTC
Many apologies for the delay in replying to this thread. As per the advice above i've decided to make use of existing biological tools (EMBOSS) and wrapping this in bioPerl to get a suitable output format rather than constructing something from scratch. I will post the solution when it's completed and also explain it for non-biologists to answer the questions below. Thanks again!	[reply]
Re: Finding biological palindromes by AnomalousMonk (Archbishop) on May 10, 2014 at 08:56 UTC
`open (TEXT, "sample.txt")\|\|die"Cannot"; ... NEXTLINE: while ($count < 1000) { $line = <TEXT> ; $count++; foreach my $value (@regexes) { ... while ($line =~ /$value/g) { ... } ... } } ...` [download] If the file you're reading has fewer than `$count`-limit lines, the loop will continue to operate on an undefined value of `$line` until `$count` reaches its limit. (I tried the code on a file with the single DNA sequence included in the OP with interesting results.) I suggest adding a test after the file read: `$line = <TEXT>;` `last NEXTLINE unless defined $line;` Use of warnings would have allowed Perl to alert you to what was happening. Use of strict is also highly recommended, but would necessitate several further changes to the code.	[reply] [d/l] [select]
Re: Finding biological palindromes by AnomalousMonk (Archbishop) on May 10, 2014 at 23:27 UTC
As a non-bio-monk, there are several questions raised in my mind by your post and by BrowserUk's reply above. With luck, they are trivial questions, and you can easily direct me to answers. BrowserUk's solution produces the 50-base palindrome shown in the OP, but also produces other sequences that seem to be palindromic (again, to this bio-naif) and that are "embedded" within the longer sequence. Are these other sequences significant? Do you want to know about all, some or none of them? In broader terms, consider the sequence `...CGATCG...CGATCG...CGATCG...` where `...` represents zero or more bases that do not participate in any palindromic relationship. `CGATCG` is itself palindromic, each pair `CGATCG...CGATCG` presents another relationship (possibly more than one?), and the entire sequence presents...? Again, which of these seeming possibilities are significant? Which do you really want to know about? I assume you are a biologist or are involved in biological computation, so these questions are probably pretty mundane. I hope it will not be much trouble to set me on the path to enlightenment.	[reply] [d/l] [select]
Re^2: Finding biological palindromes by BrowserUk (Patriarch) on May 11, 2014 at 00:07 UTC
Are these other sequences significant? Do you want to know about all, some or none of them? (2) Be able to specify a minimum and maximum length of match With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]


The stupid question is the question not asked
	PerlMonks