comment on

Here's something that uses some of the best ideas of wind (proceed by decreasing length) and AnomalousMonk (try using index as a first pass at matching).

#!/usr/bin/env perl

use strict;
use warnings;

my @strings = qw(AGCT AGGT GG AGCT);

my %uniques;
$uniques{shift @strings}++ while @strings;

my @slots;
while (my $i = each %uniques) {
    push @{$slots[length $i]}, $i;
    delete $uniques{i};
}

my $master = join (':', @{pop @slots});
while (@slots) {
    my @nomatch = grep {index ($master, $_) < 0} @{pop @slots or []};
    $master .= ':' . join (':', @nomatch) if @nomatch;
}
    
# answer
$master =~ s/:/\n/g;
print $master;
[download]

BrowserUK's questions are excellent. They affected my comment by reminding me to set expectations: My suggested solution will require a machine with enough memory to hold the entire data set. However, I've tried to keep the memory usage not too much more than that.

If I were more of a regexp whiz, I'd try to come up with some way of reducing matching up to a ':' boundary, but I'm not. Besides, I'm a recovering FORTRAN programmer, so the index command is the programming equivalent of comfort food for me.

I used a slightly more robust test set than what is listed here, but I emphasize "slightly". YMMV on real data with more corner cases....

Edit: "across" replaced by "up to" in the penultimate paragraph above. If you want to match a 399 character string within a 400 character string, you only need to check matches starting with the first two characters of the 400 character string, but the master string concatenation of all reference strings defeats any such understanding index may have. The hope is that one index on a long string is faster than N index calls on smaller strings (but I'm too lazy to check this today :-) It is very tempting to try to compile the master string into a savvy regexp (with the "o" flag) anew with each iteration of the last while loop and I'd be interested in seeing any such solution.

In reply to Re: list of unique strings, also eliminating matching substrings by jaredor
in thread list of unique strings, also eliminating matching substrings by lindsay_grey

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Problems? Is your data what you think it is?
	PerlMonks