http://qs321.pair.com?node_id=581697


in reply to Re^3: Powerset short-circuit optimization
in thread Powerset short-circuit optimization

jimt,
I too couldn't leave good enough alone. After realizing an embarrasing oversight (pointed out by ambrus), I decided to try my hand at a perl + Inline::C version. It finishes the 3_477 records in 1 second flat with constant memory consumption (less than 65MB). Furthermore, here is how it stacks up against the 67_108_863 lines of output from 9_448_847 lines of input (see How many words does it take? for more details):
#!/usr/bin/perl use strict; use warnings; use Inline C =>; for my $file (@ARGV) { open(my $fh, '<', $file) or die "Unable to open '$file' for readin +g: $!"; while (<$fh>) { my ($set) = $_ =~ /^(\w+)/; powerset($set, length($set)); } } __END__ __C__ #include <stdio.h> #include <stdlib.h> void powerset (char *set, int len) { static char seen[67108864]; static int init_flag = 0; static long val[128]; int i, j, k, new_len; long bit = 0; char *oldset, *newset; if (! init_flag) { memset(seen, '0', sizeof(seen)); val['a'] = 1; val['b'] = 2; val['c'] = 4; v +al['d'] = 8; val['e'] = 16; val['f'] = 32; val['g'] = 64; v +al['h'] = 128; val['i'] = 256; val['j'] = 512; val['k'] = 1024; v +al['l'] = 2048; val['m'] = 4096; val['n'] = 8192; val['o'] = 16384; v +al['p'] = 32768; val['q'] = 65536; val['r'] = 131072; val['s'] = 262144; v +al['t'] = 524288; val['u'] = 1048576; val['v'] = 2097152; val['w'] = 4194304; v +al['x'] = 8388608; val['y'] = 16777216; val['z'] = 33554432; init_flag = 1; oldset = malloc((len + 1) * sizeof(char)); for (i = 0; i < len; ++i) { oldset[i] = set[i]; } oldset[len] = '\0'; } else { oldset = set; } for (i = 0; i < len; ++i) { bit += val[ oldset[i] ]; } if (seen[bit] == '1') { return; } seen[bit] = '1'; printf("%s\n", oldset); if (len == 1) { return; } new_len = len - 1; newset = malloc((len + 1) * sizeof(char)); for (i = 0; i < len; ++i) { k = 0; for (j = 0; j < i; ++j) { newset[k++] = oldset[j]; } for (j = i + 1; j < len; ++j) { newset[k++] = oldset[j]; } newset[k] = '\0'; powerset(newset, new_len); } free(newset); }

Update: I had solicited feedback on my poor C in the CB and was told that, despite it being a translation of the Perl above, that documentation would go along way in getting the desired comments.

For each $set in a file, the code generates the powerset (minus the empty set). It is designed to skip any subset that has previously been produced. The code to generate the powerset is intentionally inefficient from the perspective of a single $set, but more than makes up for it when the "skip previously seen subset" is applied across all sets in the file.

The algorithm to generate the powerset (ABCD) is as follows:

  • Generate all subsets of size N-1 (ABC, ABC, ACD, BCD)
  • For each new subset, repeat process until size is 1
  • Stop recursing as soon as subset has previously been seen

Because over 67 million (2 ** 26) sets will be generated, keeping track of what has previously been seen can take up a lot of space. There is a pretty straight forward way to translate a set into an integer that fits into a contigous range of 1 .. 2 ** 26. For those interested, the algorithm is the sum of 1 << ord(x) - ord('a') where x is all lowercase chars in the subset. I have created a static lookup table val[] rather than generate the values each time. In perl, I use a bitstring with vec but in C have chosen to use a char array seen[] which takes up (2 ** 26) / 1024 * 1024 or 64MB.

Since some of these arrays need to retain their values between invocations they have been made static. It is basically then just a matter of determining if the passed in string has been seen and if not, allocate a new string of size N - 1. Generate each string of size N - 1 and call self recursively.

Cheers - L~R

Replies are listed 'Best First'.
Re^5: Powerset short-circuit optimization
by creamygoodness (Curate) on Nov 06, 2006 at 02:45 UTC

    Hi, L~R...

    First the good news: the algorithm seems fine to me, and there's only one language-specific gotcha I found that could have caused you problems. That's very impressive for someone with next-to-no C experience.

    Now the critique: This is hard code to review. The explanation helps, but there are a lot of things that get in the way.

    • Code without any comments at all is lame. Lame, lame, lame. Especially when its A) dense, and B) implements an unusual algorithm.
    • Those initializations wrap in every browser I checked. Do they wrap in yours? Looks like $h$_ amigo! Do you care?
    • While single line for/while/if blocks are a common Perl idiom, they are rare in C code.
    • If the variable has nothing to do with a "bit", don't call it bit. Yes, I know now that's an artifact of having translated Perl code that used a bit vector. But coming in fresh, I didn't, and it threw me off.
    • 4 out of 5 coders surveyed recommend naming variables with meaningful monikers rather than nebulosities like "val".
    • There are no blank lines to break up that big C function into chunks. It would be much easier to digest if it were in paragraphs -- ideally, commented paragraphs, as Damian recommends in Perl Best Practices.

    With all those impediments, the only way this ordinary human was able to get through that and follow it from start to finish was to rewrite it myself. In so doing, I'm happy to report the only thing I found that was really off was this:

    memset(seen, '0', sizeof(seen)); /* ... snip ... */ if (seen[bit] == '1') { return; } seen[bit] = '1';

    In C, '0' != 0 and '1' != 1. Those single quotes return the integer value of the enclosed character, which, for '0' on an ascii machine is not 0 but 48. :) You got away with it because you were consistent, though. :) That memset() should look like this:

    memset(seen, 0, sizeof(seen));
    --
    Marvin Humphrey
    Rectangular Research ― http://www.rectangular.com