Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Finding biological palindromes

by TJCooper (Beadle)
on May 08, 2014 at 11:44 UTC ( [id://1085446]=perlquestion: print w/replies, xml ) Need Help??

TJCooper has asked for the wisdom of the Perl Monks concerning the following question:

I've have access to the following script:

#!/usr/bin/perl $filename = "sample"; open (TEXT, "sample.txt")||die"Cannot"; $line = " "; $count = 0; for $n (5..50) { $re = qr /[CAGT]{$n}/; $regexes[$n-5] = $re; } NEXTLINE: while ($count < 1000) { $line = <TEXT> ; $count++; foreach my $value (@regexes) { $start = 0; while ($line =~ /$value/g) { $endline = $'; $match = $&; $revmatch = reverse($match); $revmatch =~ tr/CAGT/GTCA/; if ($endline =~ /^([CAGT]{0,15})($revmatch)/) { $start = 1; $palindrome = $match . "*" . $1 . "*" . $2; $palhash{$palindrome}++; } } if ($start == 0) { goto NEXTLINE; } } } open my $out, ">/DIR/results.txt"; close TEXT; while(($key, $value) = each (%palhash)) { print $out "$key => $value\n"; } exit;

Which identifies and outputs identified DNA palindromes. A biological palindrome in DNA is defined as a sequence which when read on both strands in the same direction (5' to 3') is identical:

http://imageshack.com/a/img835/2787/de98.png (as shown by the blue/red regions)

I feel like the above script is rather messy and the output is confusing and unclear. I was wondering if anybody could offer some help, tips, guidance or code itself to accomplish the following:

(1) Identify palindromic DNA sequences

(2) Be able to specify a minimum and maximum length of match

(3) An optional parameter to set whereby only results containing a certain sequence within the length of the palindrome, for example 'GATC', are printed to the output file but where this can also be left blank causing the program to print every single palindrome it finds

(4) The inputs will be DNA sequences of only 1 strand (and not both) - the output needs to be the full palindromic sequence identified for just a single strand - for example in the above photo the input would be:

AGAGGTCAGTCTGCATCGTATCGATCGTCGACGATCGATACGATGCAGACTGACGAGAG

The program would then calculate the other strand's sequence and see if there any palindromes contained within this and if so, output:

GTCAGTCTGCATCGTATCGATCGTCGACGATCGATACGATGCAGACTGAC

Many thanks!

Replies are listed 'Best First'.
Re: Finding biological palindromes
by BrowserUk (Patriarch) on May 08, 2014 at 12:47 UTC

    Not sure if this is much better than your code:

    #! perl -slw use strict; my %invert; @invert{ qw[ A C G T ] } = qw[ T G C A ]; my $in = do{ local $/; <DATA> }; chomp $in; print $in; for my $p1 ( 1 .. length( $in ) -2 ) { next unless substr( $in, $p1, 1 ) eq $invert{ substr $in, $p1+1, 1 + }; my $pals = 0; for my $p2 ( 1 .. $p1 -1 ) { last unless substr( $in, $p1-$p2, 1 ) eq $invert{ substr $in, +$p1+$p2+1, 1 }; ++$pals; } if( $pals ) { printf "%s%s at %d\n", ' 'x($p1-$pals), substr( $in, $p1-$pals, ($pals+1)*2 ), $p1-$pals; } } __DATA__ AGAGGTCAGTCTGCATCGTATCGATCGTCGACGATCGATACGATGCAGACTGACGAGAG

    Produces:

    [13:44:54.28] C:\test>1085446 AGAGGTCAGTCTGCATCGTATCGATCGTCGACGATCGATACGATGCAGACTGACGAGAG TGCA at 11 ATCGAT at 19 CGATCG at 21 GTCAGTCTGCATCGTATCGATCGTCGACGATCGATACGATGCAGACTGAC at 4 CGATCG at 31 ATCGAT at 33 TGCA at 43

    Modify to taste :)


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Finding biological palindromes
by wjw (Priest) on May 08, 2014 at 12:34 UTC
    Have you taken a look at what is on CPAN?

    I know nothing of this, but simply did a search for "CPAN panindrome" and came up with:

    If the output is unclear, clarify it. (Is this your code?). I agree with the previous response in getting rid of the 'goto'.
    Other than that, if it works... .

    Honestly, unless this is a code writing exercise, I would look for something already written and freely available. Even if it is an
    exercise in code writing, look at what has been written and use that as an example.

    Hope that is helpful...

    ...the majority is always wrong, and always the last to know about it...
    Insanity: Doing the same thing over and over again and expecting different results...
Re: Finding biological palindromes
by Laurent_R (Canon) on May 08, 2014 at 12:04 UTC
Re: Finding biological palindromes
by igelkott (Priest) on May 08, 2014 at 21:08 UTC

    Perhaps you already know this but DNA palindrome is a fairly casual way of saying inverted repeat. Not trying to be picky but the latter term should help with the search.

    With this in mind, BioPerl has an interface to the relevant function of the popular (and huge) EMBOSS package.

    Possibly a bit too narrow but you might also want to take a look at a recent paper which uses a Perl script to find a particular type of inverted repeat (MITE).

      Many apologies for the delay in replying to this thread. As per the advice above i've decided to make use of existing biological tools (EMBOSS) and wrapping this in bioPerl to get a suitable output format rather than constructing something from scratch. I will post the solution when it's completed and also explain it for non-biologists to answer the questions below. Thanks again!
Re: Finding biological palindromes
by AnomalousMonk (Archbishop) on May 10, 2014 at 08:56 UTC
    open (TEXT, "sample.txt")||die"Cannot"; ... NEXTLINE: while ($count < 1000) { $line = <TEXT> ; $count++; foreach my $value (@regexes) { ... while ($line =~ /$value/g) { ... } ... } } ...

    If the file you're reading has fewer than $count-limit lines, the loop will continue to operate on an undefined value of  $line until  $count reaches its limit. (I tried the code on a file with the single DNA sequence included in the OP with interesting results.) I suggest adding a test after the file read:
        $line = <TEXT>;
        last NEXTLINE unless defined $line;

    Use of warnings would have allowed Perl to alert you to what was happening. Use of strict is also highly recommended, but would necessitate several further changes to the code.

Re: Finding biological palindromes
by AnomalousMonk (Archbishop) on May 10, 2014 at 23:27 UTC

    As a non-bio-monk, there are several questions raised in my mind by your post and by BrowserUk's reply above. With luck, they are trivial questions, and you can easily direct me to answers.

    BrowserUk's solution produces the 50-base palindrome shown in the OP, but also produces other sequences that seem to be palindromic (again, to this bio-naif) and that are "embedded" within the longer sequence. Are these other sequences significant? Do you want to know about all, some or none of them?

    In broader terms, consider the sequence
        ...CGATCG...CGATCG...CGATCG...
    where ... represents zero or more bases that do not participate in any palindromic relationship. CGATCG is itself palindromic, each pair CGATCG...CGATCG presents another relationship (possibly more than one?), and the entire sequence presents...? Again, which of these seeming possibilities are significant? Which do you really want to know about?

    I assume you are a biologist or are involved in biological computation, so these questions are probably pretty mundane. I hope it will not be much trouble to set me on the path to enlightenment.

      Are these other sequences significant? Do you want to know about all, some or none of them?
      (2) Be able to specify a minimum and maximum length of match

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1085446]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (7)
As of 2024-04-25 08:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found