Re: Comparing strings (exact matches) in LARGE numbers FAST

Are there certain categories of the sequence data? Are we talking ChIP-seq style short sequences, or longer range sequences from PCR products or plasmids (not that it terribly matters I suppose). Any way to break the problem in to pieces helps. It sounds like you want to do a large scale blast program essentially. Bioperl doesn't have anything that will scale THAT well, and the current modules use temp files, so you will have a lot of harddrive access, further slowing down the process.

You could try taking the sequence and translating it into the alphabetical code for amino acids. This helps in a couple of ways. It shortens the amount of data that has to be run through by any pattern matching algorithm, and it also decreases the repetition of the sequence. The built in Perl method (correct me if I'm wrong folks) matches one letter at a time so if the first three letters are ACG, then it goes til it finds A, then looks for a C, etc. There are a lot of repetitive sequences in DNA, and over time this builds up and takes more time to go through. With 20 or so amino acids and less repetition, it would remove some of the wasted time on the rabbit trails so to speak. Potentially this would free up memory to load more data in at a time.

What do you mean by biomarkers? Are we talking transcription factor binding sites? There is an object oriented programming framework called TFBS (bioinformatics, vol 18 no 8 2002) that is compatible with bioperl and would likely make this pattern matching routine more efficient as it is designed for DNA code...

Bioinformatics

Comment on Re: Comparing strings (exact matches) in LARGE numbers FAST

Replies are listed 'Best First'.
Re^2: Comparing strings (exact matches) in LARGE numbers FAST by perlSD (Novice) on Aug 29, 2008 at 16:08 UTC
Hi bioinformatics, That is a very interesting idea. Yes, it's like a blast and I was in fact considering taking the 2nd file, perhaps concatenating a bit those sequences, and making a blastable database. Then I would megablast the first file against the 2nd, I have 8 processors, make the word size the length of the strings, boom.	[reply]
Re^2: Comparing strings (exact matches) in LARGE numbers FAST by perlSD (Novice) on Aug 29, 2008 at 16:18 UTC
No, those are not certain categories of sequences. The first file is a sequencing output that could be 25-100 bp. The 2nd file sequences could be a lot of things, biologically. We are looking for certain "patterns" or "motifs" in the sequences.	[reply]
Re^3: Comparing strings (exact matches) in LARGE numbers FAST by bioinformatics (Friar) on Aug 29, 2008 at 22:10 UTC
If you are scanning for motifs, you should be able to adapt TFBS to do that, as a binding site is a motif itself. On another note, STORM is another software program that is designed to do such searches, and can be integrated with a database (Statistical significance of cis-regulatory modules BMC Bioinformatics 2007, 8:19). I think that officially puts me out of ideas otherwise :) Bioinformatics	[reply]


Your skill will accomplish what the force of many cannot
	PerlMonks