Re: list of unique strings, also eliminating matching substrings

I guess I'd start by sorting the strings by length - largest first (hopefully the order of unique strings won't matter - if they do, then it gets moderately, but not immensely, more complex):

@sequences = sort { length($b) <=> length($a) } @sequences
[download]

Once we have that, then I would go through and find the unique ones. Now, given that we are going from largest to smallest, the item we're about to put in the array must either match or be a subset of an existing item. It cannot be the other way around: that an item in the output list is a subset of the item we're looking at. Thus we only have to check against the list that we have so far:


use List::MoreUtils qw(any);

my @uniq_sequences;
for my $seq (@sequences)
{
  push @uniq_sequences, $seq if
    any { index $_, $seq >= 0 } @uniq_sequences;
}
[download]

Now, since we're only looping over the large array once, and the small array many times, this may be Fast Enough. In fact, we're not looping over everything in the small array - unless there is no match found (or the item we're looking for is the last one).

There are definitely items I can think of benchmarking to see if there are speedups to be found. One possibility would be to go from small to large (reverse order) and see if a regex-optimiser could smoosh the words you're looking for into an optimised search. While I doubt this would actually speed anything up (the optimisation might dwarf the loop in my original solution above), only a benchmark would be sure.

Another possibility would be to compile your sequences down to byte sequences. Since this looks like DNA, thus only four letters, each position could basically be a 2 bits of data: A could be 0b00, G could be 0b01, C could be 0b10, and T could be 0b11. With some serious bit-manipulating math, where you have to shift stuff around for comparisons, you may be able to get more speed. Again, a benchmark would need to be done to be sure. This one likely holds some promise, but at the expense of some serious development time. My guess? It's not worth it. For the amount of money your employer would be paying you for that amount of time, it's probably cheaper to buy a faster CPU and/or more RAM, to run the original algorithm. And then you'll have a faster computer, too :-) (Of course, if someone gives you an already-tested solution to your issue using this algorithm, especially if it comes with unit tests, then TAKE IT.)

Comment on Re: list of unique strings, also eliminating matching substrings Select or Download Code


Keep It Simple, Stupid
	PerlMonks