Re^2: algorithm for 'best subsets'

A divide-and-conquer approach may help you. There is a very fast algorithm (nearly order N) called UnionFind which will divide the problem into groups of items that have no keys in common. Then each partition could be treated separately.

The module Graph::UnionFind implements it. Here is an example. ~~It finds 153 partitions of the 17,576 node problem in a minute and 26 seconds on my machine.~~ It finds 129 partitions in a minute and 14 seconds.

Update: Fixed a bug in the code, and added code to separate out the partitions and count the length of each.


#!/usr/bin/perl
use strict;
use warnings;
use Bit::Vector::Overload;
use Graph::UnionFind;

# Insert real data here.
my %items = (
    a => [ qw/one six/ ],
    b => [ qw/two three five/ ],
    c => [ qw/one two five/ ],
);
my $icount = keys %items;
print "Doing $icount items\n";

# Assign an index position to each item.
# Form a mapping from items to bit positions.
# Collect a list of bitmaps for combination work.
my @revipos;
my %ipos;
my @ilst;
my $ix = 0;
for my $itm ( sort keys %items ) {
   $ipos{$itm} = $ix;
   push @revipos,$itm;
   my $set1 = new Bit::Vector($icount + 1);
   $set1->Bit_On($ix);
   push @ilst, $set1;
   ++$ix;
}

# Form a mapping from keywords to bit positions.
my $scount = 0;
my @revkpos;
my %kpos;
for my $itm ( @revipos ) {
   for my $keyw ( @{ $items{ $itm }} ) {
      if (!exists $kpos{$keyw}) {
         $kpos{$keyw} = $scount++;
         push @revkpos, $keyw;
      }
   }
}
# Also form a reverse index for later printing.
#my %revkpos = reverse(%kpos);

my $uf = Graph::UnionFind->new;
# Supplemental data to look up by union number.
my %union_data;

# Form bit vectors with ones in the keyword positions,
# one for every item.
my %keyword_vecs;
my @lst1;
my $i = 0;
for my $itm ( @revipos ) {
   #print "doing item $i\n";
   my $set0 = new Bit::Vector($scount + 1);
   $keyword_vecs{$itm} = $set0;
   for my $keyw ( @{ $items{ $itm }} ) {
      $set0->Bit_On($kpos{$keyw});
   }

   # hold both sets - items and keywords together.
   #push @lst1,[$ilst[$ipos{$itm}], $set0];

   # unionfind to detect partitions.
   $uf->add($i);

   my $unioned = 0;
   my @ukeys = keys %union_data;
   for my $uk (@ukeys) {
      my $partu = $uf->find($uk);
      if ($uk != $partu) {
         delete $union_data{$uk};
      }
      if ($set0 & $union_data{$partu}) {
         $unioned = 1;
         $uf->union( $i, $partu);
         my $parti = $uf->find( $i );
         my $newset = $set0 | $union_data{$partu};
         delete $union_data{$partu} if ($partu != $parti);
         $union_data{$parti} = $newset;
      }
   }
   $union_data{$i} = $set0 if (!$unioned);
   
   ++$i;
}
print "Expecting ",scalar(keys %union_data)," partitions\n";

my @parts;
my %unique_parts;
my $pcount = 0;
for $i (0 .. $#revipos) {
   my $parti =  $uf->find( $i );
   if (!exists $unique_parts{$parti}) {
      $unique_parts{$parti} = $pcount;
      push @{ $parts[$pcount] }, $i;
      $pcount++;
   } else {
      push @{ $parts[ $unique_parts{$parti} ] }, $i;
   }
}
print "Found $pcount partitions\n";
for $i (0 .. $#parts) {
   print scalar( @{ $parts[$i] })," items in partition $i\n";
}
[download]

Comment on Re^2: algorithm for 'best subsets' Download Code

Replies are listed 'Best First'.
Re^3: algorithm for 'best subsets' by BrowserUk (Patriarch) on Mar 04, 2005 at 19:59 UTC
Could you explain the format of `%union_data`? What do the numeric keys represent> Item number? What do the bitstring values represent? Examine what is said, not who speaks. Silence betokens consent. Love the truth but pardon error.	[reply] [d/l] [select]
Re^4: algorithm for 'best subsets' by tall_man (Parson) on Mar 04, 2005 at 20:30 UTC
There were bugs in my earlier code, but now the intended meaning of `%union_data` is actually fulfilled. Given two numeric keys for vertices in a graph, the UnionFind "union" operation creates an edge joining them and returns one of the two keys as a master key to the joined vertex set. What I am doing with `%union_data` is supplementing the algorithm -- under each master key, there is a bit-vector with the union of all bits that are true in any member of the joined set. In my implementation, the vertices are item numbers, so the key to `%union_data` is a subset of the item numbers. The binary values are bit vectors: bit `[k]` is true if keyword`[k]` is in the set. This allows me to extend the partition easily by testing each new item against the joined sets that have already been built.	[reply] [d/l] [select]
Re^5: algorithm for 'best subsets' by BrowserUk (Patriarch) on Mar 04, 2005 at 21:11 UTC
Update: Okay. I hadn't seen your updated code when I wrote this. One question though. Couldn't you just accumulate the item numbers involved in each partition as you go, rather than rediscovering them afterwards? Then I still don't understand what the resultant dataset in `%union_data` is? The keys of the hash are a subset of the items. But how does one item number represent a partition of the total items? The values are a bitmap representing a set of keywords. I think I understand that the value bitmaps represent an inclusive OR (union) of all the keywords found in a given partition, that have no intersection with any of the keywords in any other of the partitions? Is that correct? But there is no quick way(?) to determine which items are in each partition As each 'item number' key represents it's partition, not identifies it. And each value (keyword bitmap) is also composite, so there is no way back to the individual item/keywords sets(?) in order to do the n-ary unions, there either. Don't get me wrong, this is a quicker approach than the one I was persuing--reducing the n-ary unions by excluding those that didn't share at least n keywords in a pairwise union--but I'm struggling to see how you go forward from where your code leaves off? Examine what is said, not who speaks. Silence betokens consent. Love the truth but pardon error.	[reply] [d/l]
Re^6: algorithm for 'best subsets' by tall_man (Parson) on Mar 04, 2005 at 22:05 UTC
Re^7: algorithm for 'best subsets' by BrowserUk (Patriarch) on Mar 04, 2005 at 23:27 UTC
Some notes below your chosen depth have not been shown here
Re^3: algorithm for 'best subsets' by halley (Prior) on Mar 05, 2005 at 04:16 UTC
I've spent a good couple hours this evening trying to figure out the intent of this code, and actually using my own data. I'm not there yet. I understand Bit::Vector, and making vectors that are the width of the keyword table, and marking the keyword bits for a given item. You don't seem to need the first gang of bit vectors at all; you are marking item bits in item-wide vectors, then never using those vectors. I understand what I read in Graph::UnionFind, but not its algorithm internally, but I might not need to. I think I understand the output it should give. What I don't understand is the way you're trying to combine these methods, and second-guessing the Graph::UnionFind's results on each loop. It seems to me that I can use G::UF without bit vectors at all, adding edges between correlated keywords, and then scan each partition for what keywords are in each partition. Can you speak more to your reasoning? -- `[ e d @ h a l l e y . c c ]`	[reply]
Re^4: algorithm for 'best subsets' by tall_man (Parson) on Mar 05, 2005 at 15:12 UTC
The first set of bit vectors was not used in the code I posted, but they are used in another version of the code I was working on. If I operate on them in the same way I do the second set of bit vectors, I should get a complete set of the items that were combined into each partition. It's useful for testing and for using the partitions later. I suppose you could use Graph::UnionFind by itself, but it would be horrendously inefficient. You would have to compare each pair of items to see if they intersected -- O(N^2). With my method (once I get it working right) I only have to test against the combined bit-set of each partition -- O(N*P) where P is the average number of partitions. If you really want to scale to tens of thousands of items, I think this is the right approach. However, trying to work from the outside of Graph::UnionFind was probably a mistake. I need to create a modified version that stores the bit-vectors as part of the internal structure. Then, as it re-organizes itself, it would be able to keep the information accurately. UnionFind uses a forest of n-ary trees which have only parent pointers. You can easily go from a child to the root, but not vice-versa. It can also "flatten" the tree by making more of the pointers go directly to the top. I believe the flattening is what is causing me to lose information. There are many references to the "union find" algorithm, some with animated Java applets. Here is one.	[reply]
Re^5: algorithm for 'best subsets' by halley (Prior) on Mar 05, 2005 at 15:57 UTC
Pseudocode for my version of the algorithm (without using G::UF or any graph concept at all), if I understand it correctly. By "kbits" I mean a keyword bit vector. By "parts" I mean partitions. Am I missing something? `# bag of parts # for each item # for each existing part # if this item's kbits intersect this part's kbits, # union up the keywords # for all other parts # if this part's kbits intersect that part's kbits, # merge that part into this part # prune parts emptied by merger # create a new part if no intersections found` [download] It seems to work, and scans my whole current database of 5810 keywords in 6628 items in about three seconds. Unfortunately, it grows to about 5 partitions maximum, and by the time it's done, it has merged back everything into one partition. I think that's the fault of my keywords pruning, though. Even though I filter out the 100 most boring prepositions and articles, I need to find out the remaining words that cause the most mergers... Update: How depressing. Not only is `'war'` the most common keyword in modern history, but it appears to be the common thread amongst all of the events as well; removing that one keyword broke the historical context into five separate partitions. -- `[ e d @ h a l l e y . c c ]`	[reply] [d/l] [select]
Re^6: algorithm for 'best subsets' by tall_man (Parson) on Mar 06, 2005 at 13:43 UTC