Removing redundant powersets with minimal RAM

jimt has asked for the wisdom of the Perl Monks concerning the following question:

So I've been playing around a lot with Powerset short-circuit optimization and trying to come up with a memory efficient solution. To succinctly state the original problem, given a set of sets, generate the powerset for all sets but remove all duplicates.

Examples help.


Given the sets:
ABCD
ABCE
ABD
[download]

Generate the powerset of them (the set of all possible subsets), but eliminate duplicates. So "ABC" is a subset of "ABCD" and "ABCE", but we only want to print it out once. "A" is a subset of "ABCD", "ABCE", and "ABD", but we only want to print it out once, etc.

I'd posted a solution over in that node, but far from the best one. The going best solution stores the powerset in memory, using 64 megs if each set is stored in a byte, or only 8 megabytes if each set is stored as a bit. You march along your list and keep a note of each set you generate. It's fast and a good solution.

Now, I'm not knocking it at all. It's a great solution. But I'm looking at this in the context of a bigger problem, which rapidly scales beyond available ram. Sure, with a 26 element set, you only need 64 megs (I'll make the argument with the byte based approach instead of the bits, but it's the same thing, of course). But 27 elements requires 128 megs. 28 needs 256. 29 needs 512. 32 needs a gig. And so on, a 36 element set would require 16 gigs of memory, which is probably not realistic, though, yes, you can store it and do your lookups on disk. A 36 element set is not unreasonable, since that's a-z + 0-9. We're going to gloss over the fact that a 36 element set would require more than 32 bits to store, so most machines couldn't stuff it into a single integer.

Anyway, so I've been trying to optimize it for memory consumption. Internally, all of these sets are represented as integers, but for clarity's sake in the argument, I'll just type them out as strings. Much easier to read and understand "ABCD" and "AB" than "15" and "9", for instance.

Let's start off with this input file:


ABCD
ABE
ABC
EF

Using the "store everything in the powerset approach" (or, at least al
+locating space for it), we need to store 2^6 elements (A,B,C,D,E,F) o
+r 64 bytes of data. We're going to assume that we know in advance tha
+t there are only 6 possible elements.

For sake of argument, let's assume a more compact efficient approach a
+long the lines of the original approaches using hash tables. We'd bui
+ld up a structure like this, in some arbitrary order:

#first set
ABCD
BCD
ACD
ABD
ABC
AB
AC
BC
AD
BD
CD
A
B
C
D
#second set
ABE
AE
BE
E
#third set
#generates nothing
#fourth set
EF
F

This only requires 22 bytes of data, a 65% memory savings.
[download]

My eureka moment was that instead of storing the powerset for each set encountered (pow(ABCD), pow(ABE), pow(ABC), etc.), we could just store the original set, and then check all subsequent sets to see if they're subsets of it.


#first set generates normally
ABCD    #ABCD is the first set, store internally
BCD
ACD
ABD
ABC
AB
AC
BC
AD
BD
CD
A
B
C
D
#second set
is ABE a subset of ABCD? 
    NO -> print    #ABE is the first set, store internally
is AB a subset of ABCD?
    YES -> skip
is AE a subset of ABCD?
    NO -> print
is BE a subset of ABCD?
    NO -> print
is A a subset of ABCD?
    YES -> skip
is B a subset of ABCD?
    YES -> skip
is E a subset of ABCD?
    YES -> skip
#third set
is ABC a subset of ABCD?
    YES -> skip    
#generates nothing, stores nothing
#fourth set
is EF a subset of ABCD?
    NO -> is EF a subset of ABE?
        NO -> print
is E a subset of ABCD?
    NO -> is EF a subset of ABE?
        YES -> skip
is F a subset of ABCD?
    NO -> is F a subset of ABE?
        NO -> print

internally, we only need to store ABCD, ABE, and EF for 3 bytes of dat
+a stored - a 95.3% storage savings.
[download]

Now, the problem is, of course, this takes a hell of a lot more work. instead of doing a single check at each set ("Have I seen this set before?") it needs to check and see if the set we're encountering is a subset of any previously seen set. 14 checks in this case. Ouch.

Now, remember, these sets are all stored as integers, so the checks are very very fast, its just a bitwise & and an equality, but doing 14 things quickly still loses out to doing 1 thing quickly. It also doesn't scale. As you encounter more sets, you do more and more and more checks until your CPU catches fire.

I've optimized this in two ways - one is to store the list of sets in sorted numerical order (ints internally, again). So the list would be: (EF, ABE, ABCD) (assuming higher letters are bigger bits). When doing the checks, it stops looking once it encounters a set smaller than itself.


So if we added on ABF to the list, it would check as follows:

is ABF a subset of EF?
    NO -> ABE < ABF, therefore ABF cannot be a subset of anything. Sto
+p now and print ABE
[download]

And it bows out much faster, only needing to do 2 checks instead of 3. A B tree or any fancier structure is of dubious utility at this point, since a set could potentially be a subset of any previously encountered one. Order doesn't matter, just "bigger than me" (check it) and "less than me" (we're done)

The final optimization I added was to redundantly store the previously seen sets in separate arrays for each element.


So instead of:

my @previous = qw(EF ABE ABCD)

I'd have:

my @previous = (
    [qw(ABE ABCD)],    #A slot
    [qw(ABE ABCD)],    #B slot
    [qw(ABC)],        #C slot
    [qw(ABCD)],        #D slot
    [qw(EF ABE)],    #E slot
    [qw(EF)]        #F slot
);
[download]

If a set contains a particular element, it goes into that slot. Then, when we encounter set ABF on our next pass, it would need to compare against all elements in the "A" slot, the "B" slot, and the "F" slot.


    [qw(ABE ABCD)],    #A slot
    [qw(ABE ABCD)],    #B slot
    [qw(EF)]        #F slot
[download]

If we can figure out if there's an element that exists in all three of those sets, then it solves the equation, we know that there exists some set that's a super set, and we're done. For example, if we next encountered "AB", we'd look at the "A" slot and the "B" slot, and we could see that "ABE" (and "ABCD") exists in both slots, therefore, there is a superset, therefore we can stop. This'd be awesome, but unfortunately the only solutions I've come up with effectively require storing the powerset in memory, which negate the original purpose. If I'm going to store the powerset, I'd use one of the earlier proposed approaches that would be faster.

So I don't have an efficient way to do that. To state clearly, this version of the problem is: I have several sets of sets. Rapidly tell me if an arbitrary set is an element of each of those sets. This is effectively the Longest Common Subsequence problem, which is NP Complete.

Instead, I look at the smallest set in the list, and do my iterative checks over that. For "ABF", I only need to compare against the "F" slot, and a single element ("EF). I do this since I could compare against the elements in the "A" slot, the "B" slot, and the "F" slot, but there are the fewest elements in the "F" slot, so I look there. Since "ABF" is not a subset of "EF", we're done. No further checking required.

This works since the only way a solution would exist is if its in all 3 arrays anyway. So we can just check the smallest one to do as little work as possible. We still bow out once we encounter a set smaller than we are, since they're still in sorted order internally.

And here's where I'm stuck. If I could rapidly determine if there exists an element that is in all of the subarrays chosen (in the A, B, and F slots, in this case), I'd be done. But the only approaches I've come up with to do this involve linearly scanning all the lists to build up a new data structure. But this is actually worse, since doing the check at each level only requires a linear scan of the smallest set. A scan through the "F" slot to determine if any element is a superset is faster than counting all the elements in the "A", "B", and "F" slots. This becomes especially clear if the sets are greatly imbalances. Say that there are 4,500 elements in the "A" slot, but only 1 in the "F" slot.

If there exists a reasonably fast way to do this (or, at least one that doesn't get worse as more sets are added), the gains would be tremendous. I'd be thrilled to death with something that's slower for smaller sets but scales well to larger sets. If I take .01 seconds per check, but always take .01 seconds per check, I win. Using the original "store the powerset" approach would require 16 gigs of memory for a 36 element set. Using this approach, you only are required to store (roughly):


(total number of sets) * (average length of sets) bytes of data.

Let's say we have 10,000,000 sets with an average set length of 24. Th
+at's only going to require 240,000,000 bytes to store (240 megabytes)
+, which is slightly less than 1% of the total ram of the "store it al
+l" approach.
[download]


To demonstrate the optimized version:

#first set generates normally
ABCD    #ABCD is the first set, store internally
BCD
ACD
ABD
ABC
AB
AC
BC
AD
BD
CD
A
B
C
D
#second set ABE -> look at slots A, B, E. All are the same, so arbitra
+rily check against "A" (ABCD)
is ABE a subset of ABCD? 
    NO -> print    #ABE is the first set, store internally
is AB a subset of ABCD?
    YES -> skip
is AE a subset of ABCD?
    NO -> print
is BE a subset of ABCD?
    NO -> print
is A a subset of ABCD?
    YES -> skip
is B a subset of ABCD?
    YES -> skip
is E a subset of ABCD?
    YES -> skip
#third set
is ABC a subset of ABCD?
    YES -> skip    
#generates nothing, stores nothing
#fourth set -> look at slots "E", "F". "F" is the smallest (empty), so
+ we need to do -no- checks.
EF
E
F

We store 9 bytes and do 10 checks (determining "no checks" counts as a
+ check). Note - the original
powerset version would also require 10 checks, so we're optimal here.
[download]

I just need a clever way to do the search to reduce the number of checks for the cases where they don't match or there are large sets involved.

Comment on Removing redundant powersets with minimal RAM Select or Download Code

Replies are listed 'Best First'.
Re: Removing redundant powersets with minimal RAM by samtregar (Abbot) on Nov 03, 2006 at 17:50 UTC
Not to ruin all your fun, but have you considered just writing out the dups and piping your output through 'sort -u'? The 'sort' implementation on my system is very memory efficient and reasonably fast. -sam	[reply]
Re: Removing redundant powersets with minimal RAM by Anonymous Monk on Nov 03, 2006 at 19:00 UTC
I just need a clever way to do the search to reduce the number of checks for the cases where they don't match or there are large sets involved. There can't be large sets involved. A tiny little set with only 100 elements has a power set of size 2^100. It would take 952,589,676,412,928 GB just to store a bit vector of that many elements.	[reply]
Re^2: Removing redundant powersets with minimal RAM by jimt (Chaplain) on Nov 03, 2006 at 19:16 UTC
Re-read my post - the whole point is not to store the powerset anywhere. A set with only 100 elements in my scheme needs to be stored only once. That's 13 bytes in a bit vector. The powerset is implied by checking against subsets of it using bitwise & operations. The problem is when there are multiple sets, there are multiple checks. 500 sets of 100 elements each only require 6.5k to store in memory, but require up to 500 checks to determine if the set I'm currently working with is a subset of any of them. It's that searching through those sets I want to speed up.	[reply]
Re^3: Removing redundant powersets with minimal RAM by Anonymous Monk on Nov 03, 2006 at 22:12 UTC
To succinctly state the original problem, given a set of sets, generate the powerset for all sets but remove all duplicates. I guess I'm confused by your stated purpose. If you actually generate the power set of just one set with 100 elements (let alone try to consider how it iteracts with any other set!), you'll output a power set consisting of 1,267,650,600,228,229,401,496,703,205,376 elements. Where are you going to put this power set once you generate it? It's not like you can store it on your hard drive. It won't fit on any storage device I can concieve of. You certainly can't read it off the screen. I'm confused. :-(	[reply]


No such thing as a small change
	PerlMonks