perlquestion
Dev Null
<p>Howdy Monks,</p>
<p>I have a set of objects, each of which has a collection of text attributes. I am trying to find groups of those objects with the most, and largest, sets of attributes in common. If I require all objects in a grouping to have all the attributes for it to count as a set, then clearly I'm just taking the intersection of the attribute sets, which is fairly simple. But I want to know about "partial sets" as well - I think an example will serve me best:</p>
<code>
"apple" => ("red", "round", "plant", "fruit")
"orange" => ("orange", "round", "plant", "fruit")
"pumpkin" => ("orange", "round", "plant", "vegetable")
"ball" => ("red", "round", "toy")
</code>
<p>EDIT: Updated this, as my original example was unclear.</p>
<p>For a group size of 3, I check the sets of each permutation of 3 objects:</p>
<code>
(apple, orange, pumpkin):
orange: 2
round: 3
plant: 3
fruit: 2
(apple, orange, ball):
red: 2
round: 3
plant: 2
fruit: 2
(apple, pumpkin, ball):
red: 2
round: 3
plant:2
(orange, pumpkin, ball):
orange: 2
round: 3
plant: 2
</code>
<p>(Ignoring sets of size 1 as uninteresting...) Again, this is not that hard to calculate for a given group. But what I'm trying to do is _find_ the groups with the most/largest sets (I'm provisionally using a scoring system which I _think_ is mostly irrelevant to my problem...) Brute force checking the score of all the combinations of 4 from my collection of 500-odd objects leaves me checking over fifty billion combinations; in addition to offending my sensibilities, this is kinda slow. So I'm hoping your Monkishnesses can help come up with an algorithm / search strategy for constructing sets to score that are more likely to be "interesting". Maybe something like:</p>
<p>For each pairing of two objects, find the intersection of the two attribute sets.</p>
<p>_For each pair intersection, ordered by the size of the intersect descending and above some minimum size, calculate the intersect with each other object</p>
<p>__For each triplet intersection, ordered by the size of the intersect descending and above some minimum size, calculate the score for the set of objects</p>
<p>(Just thought of that then. I'll have to go try it, though it is pretty fuzzy still.) The idea being that each time you cut off a block of intersections for being below a minimum size, you chop out a huge chunk of the search space. But since I now am not doing a complete search of the combination space, don't I run the risk of coming at the same combination from different directions? - like measuring both of "apple", "pumpkin", "ball" and "ball", "apple", "pumpkin - and doing the work twice.</p>
<p>
In any case, I appreciate any ideas, and apologize in advance if this is too much of an algorithm question that isn't related enough to perl specifically.</p>