Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: genetic algorithm for motif finding

by bioinformatics (Friar)
on Aug 13, 2013 at 21:31 UTC ( [id://1049345]=note: print w/replies, xml ) Need Help??


in reply to genetic algorithm for motif finding

The original paper for FMGA (finding motifs by genetic algorithm) is here: link. It gives a decent breakdown of the algorithm and provides pseudocode. What part(s) are you having difficulty coding? I'm not sure that someone is going to have the exact code you want lying around, but they can help you implement things.
EDIT: fixed the link

Bioinformatics
  • Comment on Re: genetic algorithm for motif finding

Replies are listed 'Best First'.
Re^2: genetic algorithm for motif finding
by BrowserUk (Patriarch) on Aug 13, 2013 at 22:14 UTC

    Finding motifs sounds like an interesting problem. Do you know of any descriptions of the problem that

    1. Don't require a brain transplant to understand the genetics terminology.
    2. Don't descend into pointless pseudo-mathematical hieroglyphics.

    Ie. Something that describes:

    • What a motif is?
    • How you know when you've found one?
    • Where you start looking?

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      A motif, also termed a consensus sequence, is a stretch of DNA that has either the exact (rare) or a similar sequence in many places across the genome; you can think of it like input to a fuzzy search, close but not exact, maybe a misspelling or two. These sites often serve as places where proteins physically interact and bind along the DNA. There are data collections, based on sequencing data, that allow one to know where along the DNA that a given protein binds, and from which one may look for enriched or common motifs by looking at the sequence from the binding coordinates. It may be that these motifs are associated with the protein in question, or they may be motifs for other proteins which interact with the protein you have data for. Collections of motifs in a region (such as the promoter region, where many proteins bind to turn a gene on, off, increased, or decreased--think dimmer switch) can be refered to as cis regulatory regions. You know you found one when you can see an enrichment or increased frequency over some background (control) sequence. Common programs used in this analysis are MEME and nestedMICA. Does this help a bit?

      Bioinformatics
        Does this help a bit?

        Some. Though it is still couched in a lot of terms that aren't immediately descriptive to the layman.

        Is this close to a paraphrase of the problem?

        A motif is a subsequence, of initially unknown length, that repeats, with minor variations, several times within a localised region of a (gene) sequence.

        The problem of finding them is that of recognising that there are several near repetitions of a subsequence within a (relatively) short stretch (100s or low thousands) of 'letters'.

        If that is close, then a few questions arise:

        • Is the 'source material' for the search coded in terms of just {acgt}?

          Or are the encoded in that other form where 1 letter is used to specif: this position might be any of 'a' or 'c'; or this position might be any of 'a' or 'g' or 't'?

        • Is there a minimum length to a motif?
        • Will the repetitions be the same length?
        • Will the seeker normally have a rough location from (or around) which to start looking?

        Maybe I've nothing to contribute to the problem; but I was playing with a novel indexing algorithm a couple of years that might lend itself to this problem. My problem is getting a clear understanding of the problem in terms I can relate to without having to go off and become conversant in the genomic terminology.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      You can learn a bit of bioinformatics at Rosalind. By solving problems, you not only gain XP and contest, but also learn about the underlying science.
      لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

        I took a look at 2 or 3 of the problems in the "bioinformatics stronghold" section -- I picked those that seemed to have the lowest number of solutions assuming they would be the hardest problems -- and for those I tried the solutions were trivial. Usually simple one-liners.

        I see little incentive to (re-)solving such simple problems; nor did I really learn that much about the terminology from doing them. In the end, I'm never going to be a bioinformatician -- no real access to real world data or problems -- so my becoming conversant in the terminology doesn't really benefit anyone.

        My interest is solely the computational and algorithmic challenges that the field presents, and for that I just need the problems described in terms I understand; and access to the (real or example) datasets.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        By /msg you said:

        Re Re^4: genetic algorithm for motif finding The problems start with the simple ones, you have to solve them to get further. I found for example EDTA or RNAS to be a bit more complicated.

        I have no idea if you have anything to do with the Rosalind site; but if you do, please do not be offended by this. It's just my opinion :) Your attempt to help me is much appreciated.

        The problem I see with the Rosalind site is this: The tutorials and challenges are all geared to leading the programmer to solve the problems in one particular way. In the case of the two examples you cite are Edit Distance & graphs respectively. Both of which are a problem.

        • Edit distance.

          There are many different algorithms for this Levestien; Wagner-Fischer; etc. and they are all horribly inefficient O(mn) compared to simply xoring teh strings and counting the nulls:

          $n = ($a ^ $b ) =~tr[\0][];

          Which is O(2N). And as both N are implemented in C (opcodes); they results can be orders of magnitude faster for DNA length strings.

        • Graphs.

          Unless a new graphs module has appeared on the scene (cpan) recently; implementing graphing algorithms in Perl is horribly slow and hugely memory hungry.

          Graphs lend themselves to being implemented using OO-style with a Nodes/Edges/AdjacencyMatrix/Attributes/Weights/etc classes all based around blessed Hashes (or worse!) and rendering the simplest of graphs constructed with them huge, cumbersome and slow.

          The idea of solving genomic problems by creating graphs of entire genes with every base a node and edges linking the pairs is just a non-starter in Perl using native Perl graphing libraries. They just require too much memory and processor.

          That is probably why so much of genomic work is parceled up and sent of to mainframes or clusters (BLAST servers and the like) with gobs of memory and huge processing power, to do the donkey work. But that in turn creates its own set of knock on problems:

          1. These batch processors tend to produce far more information than most querants require. (say) Producing edit distances and other quality statistics; when often only a boolean yes or no is required.
          2. Their output is often-as-not in a form -- multi-line 'carded' records and the like -- which make extracting the required results almost as complex as solving the original problem.

        The silly thing is, that X86-64 class machines are actually very good at string processing; and many of the tasks can be tackled very efficiently using them; once you stop viewing the problems in terms of graph theory and look for string manipulation solutions.

        That's why when these DNA/RNA problems come up; I ask for and try to extract very simple descriptions of the problems that aren't couched in either biogenetic terminology nor the mathematical symbolism applicable to just one approach to solving the problem. Unfortunately, these requests usually fall on deaf ears.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        You can learn a bit of bioinformatics at Rosalind. By solving problems, you not only gain XP and contest, but also learn about the underlying science.

        You can learn half jargon here

        Seriously, the problem with bioinformatics they want to turn every programmer into a bioinformatician -- its like wanting to turn every electricial into an electrical engineer

      Although the OP is interested in DNA, the term is also used for amino acids/proteins.

      The Wikipedia entry for Sequence motif is pretty clear, I think.

      Also enlightening may be: Sequence logo.

        The Wikipedia entry for Sequence motif is pretty clear, I think.

        I guess I must be thick then, because nothing in the first paragraph nor the overview told anything approaching: A subsequence of 4 to 20 characters that is repeated (with minor differences possible) within 'close proximity'; which is the only definition that is useful to a programmer.

        Also enlightening may be: Sequence logo.

        As it is you, I'm certain there is some relevance here; but I ain't seeing it :)


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
    A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1049345]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (6)
As of 2024-04-19 08:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found