Re: list of unique strings, also eliminating matching substrings

lindsay_grey:

I've been plunking around for about 4 hours on this one (it's an interesting problem!). I first built a test data generator to generate some datasets.

#!/usr/bin/perl
#
#       gen_random_string.pl <CNT> <SYMS> <minlen> <maxlen>
#
#       Generate <CND> random strings between <minlen> and <maxlen> ch
+aracters
#       using only the characters in <SYMS>.
#
use strict;
use warnings;

my $cnt = shift;
my $SYMS = shift;
my $min = shift;
my $max = shift or die usage('Missing args!');

die "CNT ('$cnt') must be numeric!\n" if $cnt =~ /[^0-9]/;
die "MIN ('$min') must be numeric!\n" if $min =~ /[^0-9]/;
die "MAX ('$max') must be numeric!\n" if $max =~ /[^0-9]/;
die "SYMS must be longer than 0 characters!\n" if length($SYMS) < 1;
if ($min>$max) {
        print "Min must be <= Max, swapping!\n";
        my $t=$min; $min=$max; $max=$t;
}

my @syms = split //,$SYMS;
#print join(", ", @syms),"\n";

while ($cnt) {
        my $len=$min + int(($max-$min)*rand);
        my $t='';
        $t .= $syms[int(rand(@syms))] for 1 .. $len;
        print "$t\n";
        --$cnt;
}
[download]

My primary datasets are 100, 200, 500, 1000, 2000, 5000 and 10000 strings each, where the strings are between 15 and 25 characters long. I generated them like:

$ for J in {1,2,5}0{0,00,000}; do echo $J; perl gen_random_string.pl $
+J ACGTN 15 25 >t.$J; done
[download]

I next created a trivial brute-force solver:

#!/usr/bin/perl
#
#       multi-string-match_brute_force.pl <FName>
#
use strict;
use warnings;
use feature ':5.10';

my $fname = shift;
open my $FH, '<', $fname or die;
my @candidates = <$FH>;

@candidates =
        grep { /^[ACGTN]+$/ } # delete the comments
        map { s/^\s+//; s/\s+$//; $_ }
        @candidates;

my $start = time;
@candidates = sort { length($a) <=> length($b) || $a cmp $b } @candida
+tes;
my @unique;
my $cnt_dup=0;
OUTER:
while (my $t = shift @candidates) {
        for my $u (@unique) {
                if ($t =~ /$u/) {
                        ++$cnt_dup;
                        next OUTER;
                }
        }
        push @unique, $t;
}
my $end = time - $start;

print scalar(@unique), " unique items.\n";
print "$cnt_dup rejected.\n";
print "$end seconds\n";
[download]

The brute force solver told me that all my datasets contained only unique strings. So I created some datasets with plenty of duplicates:

$ cat t.100 t.100 t.100 > t.300
$ cat t.1000 t.1000 t.1000 > t.3000
$ cat t.10000 t.10000 t.10000 > t.30000
$ cat t.100 t.300 > t.400
$ cat t.1000 t.3000 > t.4000
$ cat t.10000 t.30000 > t.40000
[download]

I've been monkeying with some different bits, but my best two (so far) give me the times:

num      brute
strings  force    Robo1  Robo2
-------  -------  -----  -----
    100     .125   .125   .110
    200     .234   .172   .125
    300     .202   .141   .110
    400     .234   .156   .110
    500    1.030   .187   .125
   1000    3.916   .265   .188
   2000   15.288   .390   .265
   3000   11.435   .546   .328
   4000   15.319   .656   .422
   5000   93.600   .858   .546
  10000  377.412  1.638  1.029
  20000           3.151  1.981
  30000           4.493  2.621
  40000           5.866  3.417
  50000                  4.929
[download]

I then created a few datasets with strings between 200 and 300 characters to see how my better one did:

# str     Robo2  Notes
------   ------  --------------
  1000    0.687  unique
  2000    1.264  1000 unique
 10000    6.412  unique
 20000   11.887  10000 unique
100000   65.224  unique
200000  126.190  100000 unique
[download]

I'll wait a little while before posting my solution, as I don't want to spoil things for people still working on it right now.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Comment on Re: list of unique strings, also eliminating matching substrings Select or Download Code

Replies are listed 'Best First'.
Re^2: list of unique strings, also eliminating matching substrings by roboticus (Chancellor) on May 29, 2011 at 16:03 UTC
Hmmm ... I thought there would be more activity on this thread. No-one seems to be actively working on it, so here's the code I used to get my timings. #!/usr/bin/perl # # multi-string-match.pl <FName> # # Grind through a set of strings, and keep only the ones that don't + contain # any of the others as a substring. FName is a file containing a l +ist of # strings, and if null, we'll use our test data. # # Inspired by perlmonks node 906020, and the Knuth-Morris-Pratt alg +orithm. # use strict; use warnings; use feature ':5.10'; # function is 10.67 chars wide, so need to round up, or we can't find +partials # (previous state will linger, so we can't find 'em!) my $hashwidth = 11; # our alphabet my %xlat = (A=>1, C=>2, G=>3, T=>4, N=>0); my @unique; my @candidates; my %MatchKeys; my $fname = shift; open my $FH, '<', $fname or die; @candidates = <$FH>; @candidates = grep { /^[ACGTN]+$/ } # delete the comments map { s/^\s+//; s/\s+$//; $_ } @candidates; my $start = time; @candidates = sort { length($a) <=> length($b) \|\| $a cmp $b } @candida +tes; my (@keypath, $t); #, @chars, @keypath); my $cnt_dup=0; CANDIDATE: while ($t = shift @candidates) { my $h = 0; my $keywidth=0; @keypath=(); my $rMatchKeys = \%MatchKeys; my $fl_partial=-1; my $l = length($t); while ($keywidth < $l) { $h = hash(substr($t,$keywidth,1), $h); ++$keywidth; if ($keywidth % $hashwidth == 0) { push @keypath, $h; } if ($fl_partial < 0) { # No current partial match if (exists $MatchKeys{$h}) { $rMatchKeys = $$rMatchKeys{$h}; $fl_partial = $keywidth; } } else { if ( ($keywidth - $fl_partial) % $hashwidth == 0 ) { $rMatchKeys = exists($$rMatchKeys{$h}) ? $$rMatchKeys{ +$h} : undef; } elsif (exists($$rMatchKeys{REM}) and exists($$rMatchKeys{R +EM}{$h})) { ++$cnt_dup; next CANDIDATE; } } } my $ar = [ $h, $keywidth % $hashwidth ]; ### Add the path to %MatchKeys $rMatchKeys = \%MatchKeys; while (my $r = shift @keypath) { $$rMatchKeys{$r} = { } if !exists $$rMatchKeys{$r}; $rMatchKeys = $$rMatchKeys{$r}; } $$rMatchKeys{REM} = { } if !exists $$rMatchKeys{REM}; if (exists($$rMatchKeys{REM}{$$ar[0]}) and $$ar[1] == $$rMatchKeys{REM}{$$ar[0]}) { ++$cnt_dup; next CANDIDATE; } $$rMatchKeys{REM}{$$ar[0]} = $$ar[1]; push @unique, $t; } my $end = time - $start; print scalar(@unique), " unique items\n"; print "$cnt_dup rejected.\n"; print "$end seconds.\n"; sub hash { my ($curchar, $prevhash) = @_; $prevhash = ($prevhash * 8 + $xlat{$curchar}) & 0xffffffff; } [download] ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply] [d/l]
Re^3: list of unique strings, also eliminating matching substrings by BrowserUk (Patriarch) on May 30, 2011 at 12:12 UTC
I used this generator to create a 10000 string file where the first 5000 string are just randomly generated and the other 5000 are random substring extracted from the first 5000. Thus, you'd expect at most 5000 unique strings with a very slight possibility of there being fewer: `#! perl -slw use strict; sub rndStr{ join'', @_[ map{ rand $#_ } 1 .. shift ] } our $N //= 10e3; my $halfN = $N >> 1; my @data; $#data = $N; $data[ $_ ] = rndStr( 200 +int( rand 200 ), 'A', 'C', 'G', 'T', 'N' ) for 0 .. $halfN; $data[ $_ + $halfN ] = substr( $data[ $_ ], 10, 10 + int( rand( length( $data[ $_ ] ) - 20 ) ) ) for 0 .. $halfN; print for @data; __END__ C:\test> 906020-gen -N=10e3 > 906020.10e3` [download] When I run your code on this file it misses some dups: `C:\test>906020-robo 906020.10e3 5551 unique items 4450 rejected. 5 seconds.` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^4: list of unique strings, also eliminating matching substrings by roboticus (Chancellor) on May 31, 2011 at 01:33 UTC
BrowserUk: Thanks, I'll dig into it tomorrow and see if I can find out what's going wrong. Update: I've confirmed the error (20110531) but haven't isolated it yet. ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply]
Re^3: list of unique strings, also eliminating matching substrings by yifangt (Initiate) on Sep 13, 2012 at 21:57 UTC
I gave it a try to test your code. There are substrings among the output, but did not figure out why. It seems there are bugs within it. Nice code though!	[reply]
Re^4: list of unique strings, also eliminating matching substrings by roboticus (Chancellor) on Sep 14, 2012 at 13:49 UTC
yifangt: Without an example of a case you're having trouble with, what do you expect me to do with your bug report? Provide a case that gives invalid results, and I can take a look at it. ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply]


"be consistent"
	PerlMonks