Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^2: Search for identical substrings

by bioMan (Beadle)
on Aug 18, 2005 at 15:52 UTC ( [id://484832]=note: print w/replies, xml ) Need Help??


in reply to Re: Search for identical substrings
in thread Search for identical substrings

The data I am working with are DNA sequences. (I know of the Bioperl project but have found nothing there that could help.) I have distilled the data to the DNA sequences alone. I have a second file with the sequence IDs, which is why I also print out the indicies from the matching strings. The data look like this.

ATGGAGAACATCACATCAGGACTCCTAGGACCCCTTCTCGTGTTACAGGC
ATGGAGAACATCACATCACGACTCCTAGGACCCCTTCACGTGAAACAGGC
ATGCTCAACGTCACATCAGGACTCCTAGGACCACGTCTCGTGTTACAGGG
ATGGTGTACATCACGACAGGATTCCTCGGAATCGCGCTGGTGACACAGGC

With the sequence IDs the data would look like this.

>seq1
ATGGAGAACATCACATCAGGACTCCTAGGACCCCTTCTCGTGTTACAGGC
>seq2
ATGGAGAACATCACATCACGACTCCTAGGACCCCTTCACGTGAAACAGGC
>seq3
ATGCTCAACGTCACATCAGGACTCCTAGGACCACGTCTCGTGTTACAGGG
>seq4
ATGGTGTACATCACGACAGGATTCCTCGGAATCGCGCTGGTGACACAGGC

### update ###

I have placed real data in my public scratchpad.

Replies are listed 'Best First'.
Re^3: Search for identical substrings
by graff (Chancellor) on Aug 18, 2005 at 18:54 UTC
    So, just to see if I understand the task... given those four (truncated?) lines of input data, would the following be the "right" answer?
    LCS for 0 :: 1 = |ATGGAGAACATCACATCA| LCS for 0 :: 2 = |TCACATCAGGACTCCTAGGACC| LCS for 0 :: 3 = |CATCAC| LCS for 1 :: 2 = |ACTCCTAGGACC| LCS for 1 :: 3 = |CATCAC| LCS for 2 :: 3 = |CAGGA|
    This doesn't keep track of the actual index offsets where the longest match actually starts in each string for each pairwise comparison, but that would be easy to add.

    That's the output from the code posted in my later reply in this thread, given those four lines of sample data as input.

      Unfortunately my data doesn't give simple answers, and that is my fault completely for not being more careful with how I constructed the test data. My results are:

      0 :: 1 ATGGAGAACATCACATCA and GACTCCTAGGACCCCTTC 0 :: 2 TCACATCAGGACTCCTAGGACC 0 :: 3 ACATCAC 1 :: 2 GACTCCTAGGACC 1 :: 3 ACATCAC 2 :: 3 CAGGA and ACAGG

      Unfortunately, the answers are not simple and that's my fault for not being more careful with the test data I created. The correct answers are:

      0 :: 1 ATGGAGAACATCACATCA and GACTCCTAGGACCCCTTC 0 :: 2 TCACATCAGGACTCCTAGGACC 0 :: 3 ACATCAC 1 :: 2 GACTCCTAGGACC 1 :: 3 ACATCAC 2 :: 3 CAGGA and ACAGG
Re^3: Search for identical substrings
by BrowserUk (Patriarch) on Aug 18, 2005 at 20:58 UTC

    I get these results:

    484593-3 Best match len:22 betwixt 0:10 & 2:10 ATGGAGAACATCACATCAGGACTCCTAGGACCCCTTCTCGTGTTACAGGC TCACATCAGGACTCCTAGGACC ATGCTCAACGTCACATCAGGACTCCTAGGACCACGTCTCGTGTTACAGGG TCACATCAGGACTCCTAGGACC 6 trials of N:4 L:50 MIN:2 ( 4.098ms total), 682us/trial

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://484832]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (6)
As of 2024-04-18 03:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found