Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re: Comparing strings (exact matches) in LARGE numbers FAST

by bioinformatics (Friar)
on Aug 29, 2008 at 06:29 UTC ( [id://707673]=note: print w/replies, xml ) Need Help??


in reply to Comparing strings (exact matches) in LARGE numbers FAST

Are there certain categories of the sequence data? Are we talking ChIP-seq style short sequences, or longer range sequences from PCR products or plasmids (not that it terribly matters I suppose). Any way to break the problem in to pieces helps. It sounds like you want to do a large scale blast program essentially. Bioperl doesn't have anything that will scale THAT well, and the current modules use temp files, so you will have a lot of harddrive access, further slowing down the process.

You could try taking the sequence and translating it into the alphabetical code for amino acids. This helps in a couple of ways. It shortens the amount of data that has to be run through by any pattern matching algorithm, and it also decreases the repetition of the sequence. The built in Perl method (correct me if I'm wrong folks) matches one letter at a time so if the first three letters are ACG, then it goes til it finds A, then looks for a C, etc. There are a lot of repetitive sequences in DNA, and over time this builds up and takes more time to go through. With 20 or so amino acids and less repetition, it would remove some of the wasted time on the rabbit trails so to speak. Potentially this would free up memory to load more data in at a time.

What do you mean by biomarkers? Are we talking transcription factor binding sites? There is an object oriented programming framework called TFBS (bioinformatics, vol 18 no 8 2002) that is compatible with bioperl and would likely make this pattern matching routine more efficient as it is designed for DNA code...
Bioinformatics
  • Comment on Re: Comparing strings (exact matches) in LARGE numbers FAST

Replies are listed 'Best First'.
Re^2: Comparing strings (exact matches) in LARGE numbers FAST
by perlSD (Novice) on Aug 29, 2008 at 16:08 UTC
    Hi bioinformatics, That is a very interesting idea. Yes, it's like a blast and I was in fact considering taking the 2nd file, perhaps concatenating a bit those sequences, and making a blastable database. Then I would megablast the first file against the 2nd, I have 8 processors, make the word size the length of the strings, boom.
Re^2: Comparing strings (exact matches) in LARGE numbers FAST
by perlSD (Novice) on Aug 29, 2008 at 16:18 UTC
    No, those are not certain categories of sequences. The first file is a sequencing output that could be 25-100 bp. The 2nd file sequences could be a lot of things, biologically. We are looking for certain "patterns" or "motifs" in the sequences.
      If you are scanning for motifs, you should be able to adapt TFBS to do that, as a binding site is a motif itself. On another note, STORM is another software program that is designed to do such searches, and can be integrated with a database (Statistical significance of cis-regulatory modules BMC Bioinformatics 2007, 8:19). I think that officially puts me out of ideas otherwise :)

      Bioinformatics

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://707673]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (11)
As of 2024-04-23 21:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found