Re: Seeking algorithm for finding common continous sub-patterns

Unless I misunderstood the question, this may be NP complete, but it doesn't take long. Leastwise not if all your numbers are small integers, < 255.

What I did was to encode the arrays of integers into strings, and then build an index to all the N-char length substrings in the 1000 x 1000-char strings. This latter process only takes a few seconds.

Less time infact than generating and encoding the test data. Reading it from a file and encoding it directly to strings would make it quicker still, by saving on the ram for all the arrays.

It takes around 27 seconds (total: Generation, encoding, indexing, counting and printing to nul!) to process 1000 x 1000-number sets on my mid-range machine.

#! perl -slw
use strict;
use Data::Dumper;

$| = 1;
$" = ', ';

our $LEN     ||= 1000;
our $COUNT    ||= 1000;
our $M         ||= 2;
our $N         ||= 2;

my @data = map {
    [ map{ int rand 256 } 1 .. $LEN ]
} 1 .. $COUNT;
warn "Gen'd data";

my @encoded = map {
    pack 'C*', @$_;
} @data;
warn 'Encoded data';

my %seq;

for my $i ( 0 .. $#encoded ) {
    for my $o ( 0 .. $LEN - $N ) {
        push @{ $seq{ substr $encoded[ $i ], $o, $N } }, $i;
    }
}

my $count = 0;
for my $run ( keys %seq ) {
    next unless @{ $seq{ $run } } > $M;
    $count++;
    printf "Run [%*s] appears in lines: %s\n",
        $N * 4,
        join(',', map{ sprintf '%03d', $_ } unpack 'C*', $run ), 
        "[@{ $seq{ $run } }]";
}

warn "Found $count $N-number runs appearing in at least $M sets"; 

__END__
[11:54:20.15] P:\test>412061 -LEN=1000 -COUNT=1000 -N=2 -M=2 >nul
Gen'd data at P:\test\412061.pl line 16.
Encoded data at P:\test\412061.pl line 21.
Found 65533 2-number runs appearing in at least 2 sets at P:\test\4120
+61.pl line 41.

[11:54:47.06] P:\test>412061 -LEN=1000 -COUNT=1000 -N=3 -M=2 >nul
Gen'd data at P:\test\412061.pl line 16.
Encoded data at P:\test\412061.pl line 21.
Found 553 3-number runs appearing in at least 2 sets at P:\test\412061
+.pl line 41.

[11:55:16.46] P:\test>
[11:55:23.76] P:\test>412061 -LEN=1000 -COUNT=1000 -N=3 -M=3 >nul
Gen'd data at P:\test\412061.pl line 16.
Encoded data at P:\test\412061.pl line 21.
Found 9 3-number runs appearing in at least 3 sets at P:\test\412061.p
+l line 41.

[11:55:55.84] P:\test>
[11:55:55.84] P:\test>412061 -LEN=1000 -COUNT=1000 -N=3 -M=3
Gen'd data at P:\test\412061.pl line 16.
Encoded data at P:\test\412061.pl line 21.
Run [ 215,017,145] appears in lines: [546, 657, 671, 986]
Run [ 205,013,147] appears in lines: [326, 360, 385, 975]
Run [ 134,174,216] appears in lines: [335, 512, 539, 592]
Run [ 114,015,201] appears in lines: [470, 544, 557, 989]
Run [ 065,233,176] appears in lines: [533, 656, 789, 930]
Run [ 003,013,171] appears in lines: [17, 169, 460, 979]
Run [ 191,241,173] appears in lines: [25, 88, 442, 782]
Run [ 245,165,229] appears in lines: [373, 405, 599, 697]
Run [ 072,156,003] appears in lines: [86, 153, 464, 544]
Run [ 055,130,150] appears in lines: [73, 127, 157, 384, 868]
Run [ 044,035,182] appears in lines: [3, 402, 734, 764]
Run [ 187,058,006] appears in lines: [171, 487, 924, 944]
Run [ 060,153,213] appears in lines: [72, 404, 708, 976]
Run [ 057,200,093] appears in lines: [221, 246, 514, 908]
Found 14 3-number runs appearing in at least 3 sets at P:\test\412061.
+pl line 41.

[12:04:02.68] P:\test>
[download]

Examine what is said, not who speaks.

"But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
"Think for yourself!" - Abigail "Time is a poor substitute for thought"--theorbtwo "Efficiency is intelligent laziness." -David Dunham
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

Comment on Re: Seeking algorithm for finding common continous sub-patterns Download Code

Replies are listed 'Best First'.
Re^2: Seeking algorithm for finding common continous sub-patterns by davidj (Priest) on Dec 03, 2004 at 13:23 UTC
As I reread his question, you may be correct. The problem is that his output does not match his stated conditions. According to his stated conditions, in the example he gives, he wants subsequences of length 2. However, his output includes a subsequence of length 3. For this reason, I am assuming that he misstated his conditions and means to say "I'd like to find common continuous subsequences of AT LEAST length n." If I am correct, then the time to solve an input of 1000's of sequences of 1000's of numbers would be prohibitively long, especially if n is small. If I am not correct, then you are closer to the actual runtime. Very good solution. davidj	[reply]
Re^3: Seeking algorithm for finding common continous sub-patterns by johnnywang (Priest) on Dec 03, 2004 at 19:04 UTC
Sorry for the confusion, I did mean "at least length n". But it's not that bad, in my particular situation, I don't expect the length to be big, say the length of the subsequence is between 2 and 10.	[reply]
Re^4: Seeking algorithm for finding common continous sub-patterns by BrowserUk (Patriarch) on Dec 03, 2004 at 20:38 UTC
As hv points out above, any 3-value run will contain one of the 2-value runs, and any 4-value run, a 3-value. Once you have built the 2-value index, you can discard any entries that do not meet the 'found in M sets' criteria and use it to fasttrack the building of the next level of index. By storing both the set number and position for each index entry on the first run, no exhaustive searching is required to build the next level of index: `xxxMMxxxxxxxMMxxxxx ## 2-value runs xx?MMxxxxxx?MMxxxxx ## Possible 3-value runs xxxMM?xxxxxxMM?xxxx ## " " "` [download] which saves a huge amount of time. You then again discard any index entries not meeting the M-sets requirement before moving on to the next level. With randomly generated values 0-255, the first level of index has (or is very close to) the theoretical maximum of 65536 entries and very few are discardable, which means that the second level (3-value) index grows quiet large. Upto 128k entries. However, at this point, the probability that the given 3-value sequence will exist in more than one of the sets drops rapidly, meaning that the M-set culling process reduces this index markedly (to a few hundreds). That makes the work required for the 4th and subsequent level indexing drop to near zero very quickly. Using randomly generated 1000x1000x0..255, I've rarely been able to find runs longer than 5-values that appear twice. A typical run is: `[19:58:08.64] P:\test>412061 -LEN=1000 -COUNT=1000 -N=10 -M=2 >log1 Generated & encoded data. 2-char substring index built: 65536 entries. 2-char index trimmed to 65535 entries. Found 65535 2-number runs appearing in at least 2 sets. 3-char index trimmed to 757 entries. Found 757 3-number runs appearing in at least 2 sets. 4-char index trimmed to 1 entries. Found 1 4-number runs appearing in at least 2 sets. 5-char index trimmed to 0 entries. [19:59:09.28] P:\test>` [download] which took almost exactly 1 minute. With a less wide range of values (smaller alphabet), the frequency would be less, but the probability higher for any given set of data, so the drop-off would be slower, but still a 1000x1000 sequence should be processable in minutes rather than hours. The modified code: #! perl -slw use strict; use bytes; use Data::Dumper; $\| = 1; $" = ', '; our $LEN \|\|= 1000; our $COUNT \|\|= 1000; our $M \|\|= 2; our $N \|\|= 3; =disabled Avoid the AoAs step to cut memory usage. my @data = map { [ map{ int rand 256 } 1 .. $LEN ] } 1 .. $COUNT; warn "Gen'd data"; =cut sub uniq { my %x; @x{ @_ } = (); keys %x } my @encoded = map { pack 'C', map{ int rand 256 } 1 .. $LEN; } 1 .. $COUNT; warn 'Generated & encoded data'; my %seq; for my $i ( 0 .. $#encoded ) { for my $o ( 0 .. $LEN - 2 ) { push @{ $seq{ 2 }{ substr $encoded[ $i ], $o, 2 } }, "$i:$o"; } } warn '2-char substring index built: ', scalar keys %{ $seq{ 2 } }, ' e +ntries.'; for my $n ( 2 .. $N ) { for my $run ( keys %{ $seq{ $n } } ) { ## Remove duplicated line numbers from index if( $n > 2 ) { ## Won't happen on the first pass, @{ $seq{ $n }{ $run } } = uniq @{ $seq{ $n }{ $run } }; } ## Trim any runs that do not appear in at least $M lines delete $seq{ $n }{ $run } unless @{ $seq{ $n }{ $run } } > $M; } warn "$n-char index trimmed to ", scalar keys %{ $seq{ $n } }, ' e +ntries.'; ## Skip out early if nothing more to do last unless keys %{ $seq{ $n } }; ## Ouput the $n-char matches. my $count = 0; for my $run ( keys %{ $seq{ $n } } ) { next unless @{ $seq{ $n }{ $run } } > $M; $count++; printf "Run [%s] appears in lines: %s\n", $n * 4, join(',', map{ sprintf '%03d', $_ } unpack 'C*', $run ), "[@{ $seq{ $n }{ $run } }]"; } warn "Found $count $n-number runs appearing in at least $M sets"; ## Go through each run remaining in index $n for my $run ( keys %{ $seq{ $n } } ) { ## And each line that run was found in for my $idx ( @{ $seq{ $n }{ $run } } ) { my( $i, $o ) = split ':', $idx; ## And build a new index ($n+1) who's keys are the previou +s runs ## +1 char infront ## +1 char behind push @{ $seq{ $n+1 }{ substr $encoded[ $i ], $o-1, $n+1 } +}, "$i:" . ( $o - 1 ) if $o > 1; push @{ $seq{ $n+1 }{ substr $encoded[ $i ], $o, $n+1 } +}, "$i:" . $o if $o < $LEN - $n +1; } } ## Save space by discarding previous index. delete $seq{ $n } } __END__ [20:29:57.57] P:\test>412061 -LEN=1000 -COUNT=1000 -N=10 > log1 Generated & encoded data at P:\test\412061.pl line 26. 2-char substring index built: 65536 entries. at P:\test\412061.pl line + 34. 2-char index trimmed to 65535 entries. at P:\test\412061.pl line 45. Found 65535 2-number runs appearing in at least 2 sets at P:\test\4120 +61.pl line 60. 3-char index trimmed to 785 entries. at P:\test\412061.pl line 45. Found 785 3-number runs appearing in at least 2 sets at P:\test\412061 +.pl line 60. 4-char index trimmed to 0 entries. at P:\test\412061.pl line 45. [20:31:03.28] P:\test> [download] Examine what is said, not who speaks. "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen "Think for yourself!" - Abigail "Time is a poor substitute for thought"--theorbtwo "Efficiency is intelligent laziness." -David Dunham "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply] [d/l] [select]


Problems? Is your data what you think it is?
	PerlMonks