Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re^3: Seeking algorithm for finding common continous sub-patterns

by johnnywang (Priest)
on Dec 03, 2004 at 19:04 UTC ( [id://412237]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Seeking algorithm for finding common continous sub-patterns
in thread Seeking algorithm for finding common continous sub-patterns

Sorry for the confusion, I did mean "at least length n". But it's not that bad, in my particular situation, I don't expect the length to be big, say the length of the subsequence is between 2 and 10.
  • Comment on Re^3: Seeking algorithm for finding common continous sub-patterns

Replies are listed 'Best First'.
Re^4: Seeking algorithm for finding common continous sub-patterns
by BrowserUk (Patriarch) on Dec 03, 2004 at 20:38 UTC

    As hv points out above, any 3-value run will contain one of the 2-value runs, and any 4-value run, a 3-value. Once you have built the 2-value index, you can discard any entries that do not meet the 'found in M sets' criteria and use it to fasttrack the building of the next level of index.

    By storing both the set number and position for each index entry on the first run, no exhaustive searching is required to build the next level of index:

    xxxMMxxxxxxxMMxxxxx ## 2-value runs xx?MMxxxxxx?MMxxxxx ## Possible 3-value runs xxxMM?xxxxxxMM?xxxx ## " " "

    which saves a huge amount of time.

    You then again discard any index entries not meeting the M-sets requirement before moving on to the next level.

    With randomly generated values 0-255, the first level of index has (or is very close to) the theoretical maximum of 65536 entries and very few are discardable, which means that the second level (3-value) index grows quiet large. Upto 128k entries. However, at this point, the probability that the given 3-value sequence will exist in more than one of the sets drops rapidly, meaning that the M-set culling process reduces this index markedly (to a few hundreds). That makes the work required for the 4th and subsequent level indexing drop to near zero very quickly.

    Using randomly generated 1000x1000x0..255, I've rarely been able to find runs longer than 5-values that appear twice. A typical run is:

    [19:58:08.64] P:\test>412061 -LEN=1000 -COUNT=1000 -N=10 -M=2 >log1 Generated & encoded data. 2-char substring index built: 65536 entries. 2-char index trimmed to 65535 entries. Found 65535 2-number runs appearing in at least 2 sets. 3-char index trimmed to 757 entries. Found 757 3-number runs appearing in at least 2 sets. 4-char index trimmed to 1 entries. Found 1 4-number runs appearing in at least 2 sets. 5-char index trimmed to 0 entries. [19:59:09.28] P:\test>

    which took almost exactly 1 minute.

    With a less wide range of values (smaller alphabet), the frequency would be less, but the probability higher for any given set of data, so the drop-off would be slower, but still a 1000x1000 sequence should be processable in minutes rather than hours.

    The modified code:

    #! perl -slw use strict; use bytes; use Data::Dumper; $| = 1; $" = ', '; our $LEN ||= 1000; our $COUNT ||= 1000; our $M ||= 2; our $N ||= 3; =disabled Avoid the AoAs step to cut memory usage. my @data = map { [ map{ int rand 256 } 1 .. $LEN ] } 1 .. $COUNT; warn "Gen'd data"; =cut sub uniq { my %x; @x{ @_ } = (); keys %x } my @encoded = map { pack 'C*', map{ int rand 256 } 1 .. $LEN; } 1 .. $COUNT; warn 'Generated & encoded data'; my %seq; for my $i ( 0 .. $#encoded ) { for my $o ( 0 .. $LEN - 2 ) { push @{ $seq{ 2 }{ substr $encoded[ $i ], $o, 2 } }, "$i:$o"; } } warn '2-char substring index built: ', scalar keys %{ $seq{ 2 } }, ' e +ntries.'; for my $n ( 2 .. $N ) { for my $run ( keys %{ $seq{ $n } } ) { ## Remove duplicated line numbers from index if( $n > 2 ) { ## Won't happen on the first pass, @{ $seq{ $n }{ $run } } = uniq @{ $seq{ $n }{ $run } }; } ## Trim any runs that do not appear in at least $M lines delete $seq{ $n }{ $run } unless @{ $seq{ $n }{ $run } } > $M; } warn "$n-char index trimmed to ", scalar keys %{ $seq{ $n } }, ' e +ntries.'; ## Skip out early if nothing more to do last unless keys %{ $seq{ $n } }; ## Ouput the $n-char matches. my $count = 0; for my $run ( keys %{ $seq{ $n } } ) { next unless @{ $seq{ $n }{ $run } } > $M; $count++; printf "Run [%*s] appears in lines: %s\n", $n * 4, join(',', map{ sprintf '%03d', $_ } unpack 'C*', $run ), "[@{ $seq{ $n }{ $run } }]"; } warn "Found $count $n-number runs appearing in at least $M sets"; ## Go through each run remaining in index $n for my $run ( keys %{ $seq{ $n } } ) { ## And each line that run was found in for my $idx ( @{ $seq{ $n }{ $run } } ) { my( $i, $o ) = split ':', $idx; ## And build a new index ($n+1) who's keys are the previou +s runs ## +1 char infront ## +1 char behind push @{ $seq{ $n+1 }{ substr $encoded[ $i ], $o-1, $n+1 } +}, "$i:" . ( $o - 1 ) if $o > 1; push @{ $seq{ $n+1 }{ substr $encoded[ $i ], $o, $n+1 } +}, "$i:" . $o if $o < $LEN - $n +1; } } ## Save space by discarding previous index. delete $seq{ $n } } __END__ [20:29:57.57] P:\test>412061 -LEN=1000 -COUNT=1000 -N=10 > log1 Generated & encoded data at P:\test\412061.pl line 26. 2-char substring index built: 65536 entries. at P:\test\412061.pl line + 34. 2-char index trimmed to 65535 entries. at P:\test\412061.pl line 45. Found 65535 2-number runs appearing in at least 2 sets at P:\test\4120 +61.pl line 60. 3-char index trimmed to 785 entries. at P:\test\412061.pl line 45. Found 785 3-number runs appearing in at least 2 sets at P:\test\412061 +.pl line 60. 4-char index trimmed to 0 entries. at P:\test\412061.pl line 45. [20:31:03.28] P:\test>

    Examine what is said, not who speaks.
    "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
    "Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://412237]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (3)
As of 2024-04-19 22:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found