Re: Random shuffling

Replies are listed 'Best First'.
Re^2: Random shuffling by BrowserUk (Patriarch) on Jun 21, 2015 at 12:10 UTC
Maybe it is worth pointing out that algorithms like Levenshtein & Longest Common Subsequence; and modules that implement them (like String::Approx) are entirely useless to genomists. They don't care about the "score"; only, is it there or not. And doing all the O(N²) or worse, work that those modules do to calculate their scores in order to derive a simple boolean answer is just way too costly. When they usually are doing 10s of thousands of (say) 50/4 matches against thousands of sequences, each containing millions of codons; using those types of fuzzy-match algorithms means runtimes measured in months or years. There are much faster ways of doing fuzzy matching when all you need is a yes/no answer. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!	[reply]
Re^3: Random shuffling by Laurent_R (Canon) on Jun 21, 2015 at 14:10 UTC
Maybe it is worth pointing out that algorithms like Levenshtein & Longest Common Subsequence; and modules that implement them (like String::Approx) are entirely useless to genomists. (...) There are much faster ways of doing fuzzy matching when all you need is a yes/no answer. Thanks for the information. I frankly have a very very limited knowledge of what geneticists are doing with their DNA and other molecular sequences. I suspected that these algorithms might be slow for what geneticists are doing (which is why I recommended to look at BioPerl) but I did not know whether there were faster ways to accomplish their tasks.	[reply]
Re^4: Random shuffling by BrowserUk (Patriarch) on Jun 21, 2015 at 15:16 UTC
but I did not know whether there were faster ways to accomplish their tasks. A slightly modified standard string compare, that counts the number of mismatched characters so far and doesn't short-circuit until that number has been exceeded, is O(N) worst case at any given match site; compared to O(N²) best case at any given match site, for any of the distance algorithms. For a 50/4 in a million that makes it 50 million byte compares worst case and 4 million byte compares best case; versus 2.5 billion bytes compares for every case with Levenshtein. The bioinformatics crowd tend to use an indexing method. (Look-up BLAST-N, BLAST-X etc.) If you're looking for 50/4, then there must be at least one exact match of 9 bytes at any successful match site: 9+1+9+1+9+1+9+1+9. So, if you index all the 9 byte sequences in the haystack; and all the 9 byte sequences in the needle; then lookup all the latter in the former; you can skip huge chunks of the haystack where a match simply could not exist. This is especially effective when searching each haystack for hundreds or thousands of needles, as the indexing of the haystack gets amortised. Further, there are on-line front-ends to mainframes/server farms that have many of the commonly searched sequences (Humans, fruit flys; corn; potatoes etc.), pre-indexed, thus further amortising those indexing costs across the searches of hundreds or thousands of individual researchers queries. There are downsides to BLAST-x; namely, there is typically a fixed size to the exact-match component, often 7 or more; which means there are limits to the ratio of needle size/permitted mismatches that can be searched for. With a minimum of 7; 47+3= 31/3 or 57+4= 39/4; 6*7+5 = 49/5 etc. Thus, if the mutation site that is being searched for is either shorter than looked for; or contains 1 or more mismatches than is being looked for; some potential sites will simply never be inspected. What you might call: a lossy, fuzzy match. It's taken me 7 or 8 years to clarify some of those details; and I might still have some of them wrong; but the basic premise is that, given the volumes of data they have to process, very fast, with the possibility of the occasional miss; is far, far preferable to absolute accuracy, if it incurs a substantial slowdown. Distance algorithms are very, very costly. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!	[reply]
Re^5: Random shuffling by Laurent_R (Canon) on Jun 21, 2015 at 15:32 UTC
Re^4: Random shuffling by onlyIDleft (Scribe) on Jun 22, 2015 at 00:19 UTC
Hi Laurent and BrowserUK - thank you for your replies. I am not trying to match the sequences to the query myself, the software I am using does it and returns the score. And BrowserUK's claim that genomicists have no use of score is not true to say the least. While it may be true that for some tasks, it might become necessary from a practical point of view, to impose a threshold score, it doesn't mean scores do not matter. Quite the contrary.... prime case in point - the BLAST tool. OK, I am going on a tangent, so back to the question at hand: BrowserUK, I am sure understand your point about randomness being randomness, so my bone of contention or argument is NOT about how many numbers of shuffles, because it is now clear to me that doing it multiple times is not better, but a waste of time, no argument there I am getting more numbers of matches in the shuffled DNA sequences than in the intact DNA sequences! This is an absurd result, which can only mean that there is so much noise being picked up by the software and reported as a match from intact sequences. When the sequence is shuffled, there is more noise that is being generated due to the shuffle per se, and since the software does not do a good job of discrimination, it reports a higher number of matches! This is the only conclusion I can arrive at based on my discussion with you folks. DNA random shuffling that I am doing is not a problem, the approach the software used in separating signal from noise is not efficient enough, is what I am thinking.... Unless any of you folks have a different opinion about my conclusion regarding the software I am using, and why it is reported more number of matches despite DNA sequence random shuffling, I consider this matter closed. Thanks to all who participated in this discussion, I am grateful for your inputs, suggestions, thoughts and code. Cheers!	[reply]
Re^5: Random shuffling by BrowserUk (Patriarch) on Jun 22, 2015 at 02:14 UTC


We don't bite newbies here... much
	PerlMonks