Re^4: Powerset short-circuit optimization

in reply to Re^3: Powerset short-circuit optimization
in thread Powerset short-circuit optimization

jimt,
I too couldn't leave good enough alone. After realizing an embarrasing oversight (pointed out by ambrus), I decided to try my hand at a perl + Inline::C version. It finishes the 3_477 records in 1 second flat with constant memory consumption (less than 65MB). Furthermore, here is how it stacks up against the 67_108_863 lines of output from 9_448_847 lines of input (see How many words does it take? for more details):

Pure perl ver 1 = 343 min
Pure perl ver 2 = 288 min
Pure perl ver 3 = 251 min
Java = 66 min
Perl + Inline::C = 3 min

#!/usr/bin/perl
use strict;
use warnings;
use Inline C =>;

for my $file (@ARGV) {
    open(my $fh, '<', $file) or die "Unable to open '$file' for readin
+g: $!";
    while (<$fh>) {
        my ($set) = $_ =~ /^(\w+)/;
        powerset($set, length($set));
    }
}
__END__
__C__
#include <stdio.h>
#include <stdlib.h>

void powerset (char *set, int len) {
    static char seen[67108864];
    static int init_flag = 0;
    static long val[128];
    int i, j, k, new_len;
    long bit = 0;
    char *oldset, *newset;
    if (! init_flag) {
        memset(seen, '0', sizeof(seen));
        val['a'] = 1;        val['b'] = 2;       val['c'] = 4;       v
+al['d'] = 8;
        val['e'] = 16;       val['f'] = 32;      val['g'] = 64;      v
+al['h'] = 128;
        val['i'] = 256;      val['j'] = 512;     val['k'] = 1024;    v
+al['l'] = 2048;
        val['m'] = 4096;     val['n'] = 8192;    val['o'] = 16384;   v
+al['p'] = 32768;
        val['q'] = 65536;    val['r'] = 131072;  val['s'] = 262144;  v
+al['t'] = 524288;
        val['u'] = 1048576;  val['v'] = 2097152; val['w'] = 4194304; v
+al['x'] = 8388608;
        val['y'] = 16777216; val['z'] = 33554432;
        init_flag = 1;
        oldset = malloc((len + 1) * sizeof(char));
        for (i = 0; i < len; ++i) { oldset[i] = set[i]; }
        oldset[len] = '\0';
    }
    else { oldset = set; }
    for (i = 0; i < len; ++i) { bit += val[ oldset[i] ]; }
    if (seen[bit] == '1') { return; }
    seen[bit] = '1';
    printf("%s\n", oldset);
    if (len == 1) { return; }
    new_len = len - 1;
    newset = malloc((len + 1) * sizeof(char));
    for (i = 0; i < len; ++i) {
        k = 0;
        for (j = 0; j < i; ++j)       { newset[k++] = oldset[j]; }
        for (j = i + 1; j < len; ++j) { newset[k++] = oldset[j]; }
        newset[k] = '\0';
        powerset(newset, new_len);
    }
    free(newset);
}
[download]

Update: I had solicited feedback on my poor C in the CB and was told that, despite it being a translation of the Perl above, that documentation would go along way in getting the desired comments.

For each $set in a file, the code generates the powerset (minus the empty set). It is designed to skip any subset that has previously been produced. The code to generate the powerset is intentionally inefficient from the perspective of a single $set, but more than makes up for it when the "skip previously seen subset" is applied across all sets in the file.

The algorithm to generate the powerset (ABCD) is as follows:

Generate all subsets of size N-1 (ABC, ABC, ACD, BCD)
For each new subset, repeat process until size is 1
Stop recursing as soon as subset has previously been seen

Because over 67 million (2 ** 26) sets will be generated, keeping track of what has previously been seen can take up a lot of space. There is a pretty straight forward way to translate a set into an integer that fits into a contigous range of 1 .. 2 ** 26. For those interested, the algorithm is the sum of 1 << ord(x) - ord('a') where x is all lowercase chars in the subset. I have created a static lookup table val[] rather than generate the values each time. In perl, I use a bitstring with vec but in C have chosen to use a char array seen[] which takes up (2 ** 26) / 1024 * 1024 or 64MB.

Since some of these arrays need to retain their values between invocations they have been made static. It is basically then just a matter of determining if the passed in string has been seen and if not, allocate a new string of size N - 1. Generate each string of size N - 1 and call self recursively.

Cheers - L~R

In Section Seekers of Perl Wisdom