Re^4: algorithm for 'best subsets'

Replies are listed 'Best First'.
Re^5: algorithm for 'best subsets' by BrowserUk (Patriarch) on Mar 04, 2005 at 21:11 UTC
Update: Okay. I hadn't seen your updated code when I wrote this. One question though. Couldn't you just accumulate the item numbers involved in each partition as you go, rather than rediscovering them afterwards? Then I still don't understand what the resultant dataset in `%union_data` is? The keys of the hash are a subset of the items. But how does one item number represent a partition of the total items? The values are a bitmap representing a set of keywords. I think I understand that the value bitmaps represent an inclusive OR (union) of all the keywords found in a given partition, that have no intersection with any of the keywords in any other of the partitions? Is that correct? But there is no quick way(?) to determine which items are in each partition As each 'item number' key represents it's partition, not identifies it. And each value (keyword bitmap) is also composite, so there is no way back to the individual item/keywords sets(?) in order to do the n-ary unions, there either. Don't get me wrong, this is a quicker approach than the one I was persuing--reducing the n-ary unions by excluding those that didn't share at least n keywords in a pairwise union--but I'm struggling to see how you go forward from where your code leaves off? Examine what is said, not who speaks. Silence betokens consent. Love the truth but pardon error.	[reply] [d/l]
Re^6: algorithm for 'best subsets' by tall_man (Parson) on Mar 04, 2005 at 22:05 UTC
It wouldn't be easy to accumulate the item numbers, because the data structure for UnionFind keeps changing as I go. At any given point, I can find out to which partition a vertex belongs to by looking it up with "find". So the easiest way to get the partitions is to gather them up at the end. However, I am still seeing a bug. Some of my "one-item" partitions seem to have shared keys with other partitions. Cases near the beginning of the item set (like "iaac") tend to have this problem. Trying to gather the items into partitions as I go along might help me find the bug. The purpose of this pass is to find all the completely distinct partitions. There's no point in looking for n-ary unions among things that have no bits in common. So I would apply the original algorithm to each subset. This pass will also help halley decide if the keywords are too generic. If it's all one big clump, there may be too many common words in the set.	[reply]
Re^7: algorithm for 'best subsets' by BrowserUk (Patriarch) on Mar 04, 2005 at 23:27 UTC
It's definitely a very valuable pass to make. It cuts down the combinatorics immensely, especially (as you said) if you can remove the most common words from the picture. I implemented it myself, because I couldn't work out how to accumulate the partiton items on the fly using Graph::UnionFind--for good reason it seems:). My homebrew implementation finds 141 partitons in the 17576 keywords / 676 items set in .5 seconds. 292 partitons in the 436697 keywords / 676 items set in 23 seconds. 745 partitions in the 17676 keywords / 17676 items set in 18 seconds. `[22:56:47.23] P:\test>436050-2 -W=3 -I=2 -NODETAILS Keywords: 16873 Items: 676 141 partitons found. [22:56:47.81] P:\test>436050-2 -W=4 -I=2 -NODETAILS Keywords: 438697 Items: 676 292 partitons found. [22:57:10.23] P:\test>436050-2 -W=3 -I=3 -NODETAILS Keywords: 16873 Items: 17576 745 partitons found. [22:57:28.48] P:\test>` [download] Which is good news because it means that it scales well in both directions. I tried the 438897K / 438697I combination, but even with avoiding the memory requirement of Gr::UnF's hashes, it still requires more memory than I have, and my disk is badly fragged so I am defragging it before trying again. Once you have partitioned, would you agree that it makes sense to do a pairwise combinations of the partitions and isolating those pairs within the partition that have nothing in common? Examine what is said, not who speaks. Silence betokens consent. Love the truth but pardon error.	[reply] [d/l]
Re^8: algorithm for 'best subsets' by tall_man (Parson) on Mar 04, 2005 at 23:35 UTC
Re^9: algorithm for 'best subsets' by BrowserUk (Patriarch) on Mar 05, 2005 at 00:09 UTC
Re^9: algorithm for 'best subsets' by BrowserUk (Patriarch) on Mar 08, 2005 at 01:19 UTC


No such thing as a small change
	PerlMonks