Groups of Objects with Common Attributes

Dev Null has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Groups of Objects with Common Attributes by alexander_lunev (Pilgrim) on May 14, 2018 at 20:10 UTC
#!/usr/local/bin/perl -w use strict; my %things = ( "apple" => ["red", "round", "plant", "fruit"], "orange" => ["orange", "round", "plant", "fruit"], "pumpkin" => ["orange", "round", "plant", "vegetable"], "ball" => ["red", "round", "toy"], ); my %kinds; # # here we populate %kinds hash with kinds from %things, but of course # in real life we should do it while populating %things hash # foreach my $t (keys %things) { foreach my $k (@{$things{$t}}) { $kinds{$k}{$t} = 1; } } # search for "red round plant" my @search = split (' ',"red round plant"); my %kinds_slice; # # we can use slice here only because kinds are strings # if your kinds are not strings - well, it's another story # @kinds_slice{@search} = @kinds{@search}; my %things_sort; # # this is boring "count it" loop, but hey, # we're counting only %kinds_slice, not the whole/all %kinds! # foreach my $kind (keys %kinds_slice) { foreach my $thing (keys %{ $kinds_slice{$kind} }) { $things_sort{$thing}++; } } # # all things that match search, sorted # print "Matches:\n",join ("\n",map {"$_: $things_sort{$_}"} sort { $thi +ngs_sort{$b} <=> $things_sort{$a} } keys %things_sort),"\n\n"; # # or just get the first value to get best match # my @matches = sort { $things_sort{$b} <=> $things_sort{$a} } keys %thi +ngs_sort ; print "Best match: ".$matches[0]."\n Score: ".$things_sort{$matches[0] +}."\n"; [download]	[reply] [d/l]
Re^2: Groups of Objects with Common Attributes by martink (Initiate) on May 15, 2018 at 22:45 UTC
You can "deal" with this using hierarchical clustering, in the sense that it might narrow a large problem into a set of many smaller problems that you can brute force. First, for each text attribute you assign a dimension, so that your objects are n-vectors that look like, for example, `[1,0,0,1,0]`, if you had 5 text attributes and this object that the first and fourth attribute. Then you plug these entries into a hierarchical clustering algorithm. You can cut the tree at any level and look at how objects have been grouped. This will let you identify objects that are close together in this space (i.e. share many attributes). What I've done below is an experiment in which I first calculate the sets of all items for which pairwise distances are the same and then hierarchically cluster each set. What you can then do is look into each set and find all combinations and find their shared attributes. For example, the largest intersection is apple,orange,pumpkin which is plant,round. But within this set, orange,pumpkin share orange,plant,round - this doesn't get picked up by the clustering ;/ Here is some code that produces the following output, which I've trimmed for display ./pairs \| grep cut \| cut -d " " -f 4- \| sort -u \| sort -nr -k3 cluster,items,attr 0 3 2 apple,orange,pumpkin plant,round cluster,items,attr 0 3 1 ball,orange,pumpkin round cluster,items,attr 0 3 1 apple,ball,pumpkin round cluster,items,attr 1 2 2 apple,pumpkin plant,round cluster,items,attr 1 2 1 ball,pumpkin round cluster,items,attr 2 1 4 pumpkin orange,plant,round,vegetable cluster,items,attr 1 1 4 apple fruit,plant,red,round cluster,items,attr 1 1 3 ball red,round,toy cluster,items,attr 0 1 4 orange fruit,orange,plant,round cluster,items,attr 0 1 4 apple fruit,plant,red,round [download] A bit messy: use List::Util (sum); use List::MoreUtils qw(uniq); use Algorithm::Cluster; my $data = { apple => [qw( red round plant fruit)], orange => [qw(orange round plant fruit)], pumpkin => [qw(orange round plant vegetable)], ball => [qw( red round toy)], }; # list of all attributes my @items = sort keys %$data; my @attr = sort(uniq( map { @{$data->{$_}} } keys %$data)); # data set recast as vectors my $datav; for my $item (@items) { my $vec; for my $attr (@attr) { push @$vec, scalar grep($_ eq $attr, @{$data->{$item}}); } push @$datav, $vec; } # keep track of all item sets for which items had the same distance my $scores; for my $i (0..@items-1) { for my $j ($i+1..@items-1) { my $score = pair_score($datav->[$i],$datav->[$j]); push @{$scores->{$score}}, ($i,$j); } } # now go through the sets of items with # same scores and hierarchically cluster them # based on a distance matrix generated by the score function for my $score (sort {$b <=> $a} keys %$scores) { my @item_idx = uniq(@{$scores->{$score}}); printf("score %d items %s\n",$score,join(",",@items[@item_idx])); cluster_and_report(@item_idx); } # distances matrix sub distance_matrix { my @item_idx = @_; my $distances; for my $i (0..@item_idx-1) { push @{$distances},[]; for my $j (0..$i-1) { push @{$distances->[-1]}, pair_score( $datav->[$item_idx[$i]],$d +atav->[$item_idx[$j]] ); } } return $distances; } # decide how to score a pair of items sub pair_score { my ($x,$y) = @_; my $score = 0; for my $i (0..@$x-1) { if($x->[$i] == $y->[$i]) { $score += $x->[$i]; # +1 score for a shared attribute } else { #$score--; # potential penalty for unshared attributes } } return $score; } sub cluster_and_report { my @item_idx = @_; my $zerov = [ map { 0 } @attr ]; my %param = ( data => distance_matrix(@item_idx), mask => [ map { $zerov } @item_idx ], weight => [ map { 1 } @attr ], transpose => 0, dist => "e", method => "s", ); my $tree = Algorithm::Cluster::treecluster(%param); for my $cut_level (1..int(@item_idx)) { my ($clusters) = $tree->cut($cut_level); #printdumper($clusters); my @cluster_ids = uniq(@$clusters); for my $cluster_id (@cluster_ids) { my @cluster_item_idx = map { $item_idx[$_] } grep($clusters->[$_ +] == $cluster_id, (0..@item_idx-1)); my @shared_vector = shared_vector( map { $datav->[$_] } @clus +ter_item_idx); my @shared_attr = map { $attr[$_] } grep($shared_vector[$_] +, (0..@shared_vector-1)); printinfo(sprintf("cut level %d cluster,items,attr %d %d %d %s % +s", $cut_level, $cluster_id, int(@cluster_item_idx), int(@shared_attr), join(",",@items[@cluster_item_idx]), join(",",@shared_attr))); } } } # use the shared attributes as a string to find sets of # items that share same attribute sub shared_vector { my @datav = @_; my $shared; for my $i (0..@{$datav[0]}-1) { if(grep($_->[$i] == 1, @datav) == @datav) { push @$shared, 1; } else { push @$shared, 0; } } return @$shared; } [download]	[reply] [d/l] [select]
Re^2: Groups of Objects with Common Attributes by Dev Null (Acolyte) on May 14, 2018 at 21:54 UTC
Thank you for your reply Alexander! Unfortunately, I don't think I was very clear in defining the problem, for which I apologize. You have counted up the number of occurrences of all the attributes, and returned the attribute with the highest number of hits. What I'm trying to do is feed it the entire list of objects (~500) and a size for a group (say, 4) and have it return for me the group or groups of 4 distinct objects that have the most attributes in common. So for my sample data from earlier, and a group size of 3, we'd look at all the permutations of 3 objects: `(apple, orange, pumpkin): orange: 2 round: 3 plant: 3 fruit: 2 (apple, orange, ball): red: 2 round: 3 plant: 2 fruit: 2 (apple, pumpkin, ball): red: 2 round: 3 plant:2 (orange, pumpkin, ball): orange: 2 round: 3 plant: 2` [download] (Ignoring sets of size 1 as uninteresting...) And then, depending on how we choose to score them, say that (apple, orange, pumpkin) was the best set, and (apple, orange, ball) next best.	[reply] [d/l]
Re: Groups of Objects with Common Attributes by hdb (Monsignor) on May 15, 2018 at 07:23 UTC
Just some odd thoughts rather than a clear strategy. Say you have 75 attributes and 500 objects. Define a measure of distance between any pair of objects, e.g. 75 minus the number of common objects or 1 over the number of common objects. Calculate all distances between pairs. This gives you 250*499 distances. Apply some kind of clustering algorithm, e.g. nearest neighbors or something. These should be available on CPAN. Playing with distance measures and clustering algorithms (and their parameters) will lead you to a number of different solutions.	[reply]
Re^2: Groups of Objects with Common Attributes by bliako (Monsignor) on May 15, 2018 at 10:18 UTC
Continuing on what hdb suggested, Each object lives in a space where the said Attributes are the coordinates. Coordinates can be discreet or binary or continuous, e.g. binary: Apple -> (orange=0, red=1, plant=1, fruit=1,toy=0). Continuous means that there is so much probability that apple is orange (0.5) or apple is a toy (0.01). etc. Notice that each object is characterised by a set of coordinates which includes ALL attributes. If attribute does not relate to object then it is set to zero. In this space, there is a distance metric, e.g. euclidean (others exist) to tell you how far apart are an Apple and a pumpkin. Clustering is a process which groups nearby objects (in this space) together based on the distance metric chosen. I see a problem with the above approach: most objects will have most of their coordinates set to zero, e.g. an Apple has only 3 attributes turned on and I guess the rest 72 will be off/zero. The problem is that clustering may group together objects because they have in common the absence of a lot of attributes, and you probably want to group objects together because they have in common the presence of an attribute. An obstacle but not a tough one.	[reply]
Re: Groups of Objects with Common Attributes by LanX (Saint) on May 15, 2018 at 11:30 UTC
k-means clustering ? --> Algorithm::KMeans update With aforementioned module you can even choose the range of permitted clusters `# For very large data files, setting K to 0 will result in searching + through too # many values for K. For such cases, you can range limit the values + of K to search # through by my $clusterer = Algorithm::KMeans->new( datafile => $datafile, mask => "N111", Kmin => 3, Kmax => 10, cluster_seeding => 'random', + # or 'smart' terminal_output => 1, write_clusters_to_files => 1 +, );` [download] Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Wikisyntax for the Monastery}	[reply] [d/l]
Re: Groups of Objects with Common Attributes by martink (Initiate) on May 16, 2018 at 05:15 UTC
I found a better way than the hierarchical clustering. The code below generates a line for each combination: (a) count of attributes, (b) count of items, (c) attribute list (d) item list. I pass it through sort -u because there may be duplication. `1 4 round apple,ball,orange,pumpkin 2 2 red,round apple,ball 2 3 plant,round apple,orange,pumpkin 3 1 red,round,toy ball 3 2 fruit,plant,round apple,orange 3 2 orange,plant,round orange,pumpkin 4 1 fruit,orange,plant,round orange 4 1 fruit,plant,red,round apple 4 1 orange,plant,round,vetetable pumpkin` [download] The lines that trivially report a single item and all its attributes would come into play if you had multiple items with the same attribute set. The code is below. Basically what happens is that for the item2attr hash, I look at all unique attribute lists across items and then report the items that have this set in common. This comes into its own when you flip the hash and report the same thing, but flip the role of item and attribute. In the end you get a full list. I hope that you find this useful and that it does what I think it does :) my $item2attr = { apple => {red=>1, round=>1,plant=>1,fruit=>1}, orange => {orange=>1,round=>1,plant=>1,fruit=>1}, pumpkin => {orange=>1,round=>1,plant=>1, vetetable= +>1}, ball => {red=>1, round=>1, + toy=>1}, }; ## alternatively in the block below, generate a random data set with ## 500 items and 75 attributes with randomly 2-10 attributes per item =pod my $n_items = 500; my $n_attributes = 75; my $min_attr_in_item = 2; my $max_attr_in_item = 10; $item2attr = {}; for my $i (1..$n_items) { my $item = sprintf("it%03d",$i); my $n_attr = $min_attr_in_item + rand(1+$max_attr_in_item-$min_attr_ +in_item) ; my @attrs = sort ((map { sprintf("at%03d",$_) } (sort {rand() <=> r +and() } (1..75)))[0..$n_attr-1]); #printinfo($item,int(@attrs),@attrs); map {$item2attr->{$item}{$_} = 1} @attrs; } =cut # list of all items and attributes my @items = sort keys %$item2attr; my @attr = sort(uniq( map { keys %$_ } values %$item2attr)); # flip the hash my $attr2item; for my $attr (@attr) { map { $attr2item->{$attr}{$_} = $item2attr->{$_}{$attr} if $item2at +tr->{$_}{$attr} } @items; } report_sets($item2attr); report_sets($attr2item,-swap=>1); sub report_sets { my ($hash,%args) = @_; my $sets; for my $key (keys %$hash) { my $set_hash_str = join(",", sort keys %{$hash->{$key}}); $sets->{$set_hash_str}{$key}++; } for my $set_hash_str (keys %$sets) { my @attr = split(",",$set_hash_str); my @shared_attr = shared_items($hash,@attr); if($args{-swap}) { printinfo(int(@shared_attr),int(@attr),join(",",@shared_attr),jo +in(",",@attr)); } else { printinfo(int(@attr),int(@shared_attr),join(",",@attr),join(",", +@shared_attr)); } } } sub shared_items { my ($hash,@attr) = @_; my @shared_items; my @items = keys %$hash; for my $item (@items) { my $n = grep($hash->{$item}{$_}, @attr); push @shared_items, $item if $n == @attr; } return sort @shared_items; } [download]	[reply] [d/l] [select]
Re^2: Groups of Objects with Common Attributes by vr (Curate) on May 17, 2018 at 17:22 UTC
After couple days of looking up what on earth is clustering, k-means, etc. -- hopeless -- I took closer look at your code. Wow, so simple. Thanks, martink! Here is re-factored version, if I may, discarding all that was perceived superfluous and simplifying (to fit my brain). So it's, effectively, just 2 plain loops: over all attributes and all items. Loop over items doesn't add to example with pumpkins, but is required for other test cases. I wonder, is it a mathematical fact, that even for 500 items and 75 attributes, there can be no more than 575 sets of common attributes? It somewhat contradicts to what I remember from combinatorics. use strict; use warnings; use feature 'say'; use List::Util qw/ uniq all /; use Data::Dump 'dd'; my $item2attr = { apple => { red => 1, round => 1, plant => 1, fruit => 1 } +, orange => { orange => 1, round => 1, plant => 1, fruit => 1 } +, pumpkin => { orange => 1, round => 1, plant => 1, vegetable => 1 } +, ball => { red => 1, round => 1, toy => 1 }, }; # list of all items and attributes my @items = sort keys %$item2attr; my @attr = sort( uniq( map { keys %$_ } values %$item2attr )); # flip the hash my $attr2item; for my $attr ( @attr ) { for ( @items ) { $attr2item-> { $attr }{ $_ } = 1 if $item2attr-> { $_ }{ $attr } } } #dd $item2attr; #say '-----------------------------------'; #dd $attr2item; #say '-----------------------------------'; my %solutions; # hash, to prevent duplicates for ( @attr ) { my @items_ = keys %{ $attr2item-> { $_ }}; my @attr_ = grep { my $attr = $_; all { $item2attr-> { $_ }{ $attr }} @items_ } @attr; _add_solution( \@attr_, \@items_ ) } for ( @items ) { my @attr_ = keys %{ $item2attr-> { $_ }}; my @items_ = grep { my $item = $_; all { $attr2item-> { $_ }{ $item }} @attr_ } @items; _add_solution( \@attr_, \@items_ ) } dd values %solutions; # then filter solutions for required number of common # attributes, or find max set of common attributes, # or find max set of items with any common attributes, etc. sub _add_solution { # writes to %solutions my ( $attr, $items ) = @_; return unless $#$items; # skip uninteresting @$_ = sort @$_ for @_; $solutions{ join ',', @$attr } = [ scalar @$attr, # count of attributes scalar @$items, # count of items $attr, # attribute list $items # item list ] } __END__ ( [2, 2, ["red", "round"], ["apple", "ball"]], [2, 3, ["plant", "round"], ["apple", "orange", "pumpkin"]], [1, 4, ["round"], ["apple", "ball", "orange", "pumpkin"]], [3, 2, ["fruit", "plant", "round"], ["apple", "orange"]], [3, 2, ["orange", "plant", "round"], ["orange", "pumpkin"]], ) [download] Edit: fixed issue with sorting.	[reply] [d/l]
Re: Groups of Objects with Common Attributes by Anonymous Monk on May 14, 2018 at 19:51 UTC
ok, 500 objects, but how many unique attributes?	[reply]
Re^2: Groups of Objects with Common Attributes by Dev Null (Acolyte) on May 14, 2018 at 21:37 UTC
About 75 unique attributes. Most objects have between 2 and 7 attributes. So the "attribute denisty" - if that made-up term conveys what I intend - is fairly low.	[reply]
Re^3: Groups of Objects with Common Attributes by Anonymous Monk on May 15, 2018 at 16:43 UTC
Most objects have between 2 and 7 attributes. With 7 attributes, there are but 2**7 == 128 subsets to consider; and no attribute group with more that 7 attributes. You could enumerate all subsets of each object. That might total to a few thousand unique attribute groups for the whole problem. In other words, with size and density so low, you can easily afford to construct the powerset of attributes. This way you can transform the 500 by 75 into 500 by few-thousands, search and sort by cardinality or weight, etc.	[reply]
Re: Groups of Objects with Common Attributes by Anonymous Monk on May 15, 2018 at 15:39 UTC
Ah. Another Data mining question. Please note that your fundamental problem, set cover optimization, is known to be NP-hard. That said, take a look at Comparing two arrays and other posts by same OP. 500 by 75 sounds too small to be a real-life problem. But if it is, could you please share with us the focus area of your research?	[reply]
Re^2: Groups of Objects with Common Attributes by Dev Null (Acolyte) on May 15, 2018 at 17:08 UTC
Heh. "Research." I was playing a tablet game the other day (Gems of War, if anyone really cares), and the question of how to write a program to optimize team synergies intrigued me. It's not even an important part of the game - it was just the bit that interested me. I wrote the quick-and-dirty brute force solution but it's slow and aesthetically unpleasing, so I started poking around at better solutions - it seemed like the sort of thing that would be a known problem, and that you folks would know about. I only only went to the trouble of abstracting my problem description because I didn't want to have to explain the game mechanics, and because I thought it might make it easier to recognize the general class of problem. Thanks everyone for the feedback! As a hobby project, it may take me a day or two to have time to work on it again, but I'll definitely pop back in to let you know how it goes.	[reply]

Groups of Objects with Common Attributes

update