Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: Groups of Objects with Common Attributes

by martink (Initiate)
on May 16, 2018 at 05:15 UTC ( [id://1214607]=note: print w/replies, xml ) Need Help??


in reply to Groups of Objects with Common Attributes

I found a better way than the hierarchical clustering.

The code below generates a line for each combination: (a) count of attributes, (b) count of items, (c) attribute list (d) item list.

I pass it through sort -u because there may be duplication.

1 4 round apple,ball,orange,pumpkin 2 2 red,round apple,ball 2 3 plant,round apple,orange,pumpkin 3 1 red,round,toy ball 3 2 fruit,plant,round apple,orange 3 2 orange,plant,round orange,pumpkin 4 1 fruit,orange,plant,round orange 4 1 fruit,plant,red,round apple 4 1 orange,plant,round,vetetable pumpkin

The lines that trivially report a single item and all its attributes would come into play if you had multiple items with the same attribute set.

The code is below. Basically what happens is that for the item2attr hash, I look at all unique attribute lists across items and then report the items that have this set in common. This comes into its own when you flip the hash and report the same thing, but flip the role of item and attribute. In the end you get a full list.

I hope that you find this useful and that it does what I think it does :)

my $item2attr = { apple => {red=>1, round=>1,plant=>1,fruit=>1}, orange => {orange=>1,round=>1,plant=>1,fruit=>1}, pumpkin => {orange=>1,round=>1,plant=>1, vetetable= +>1}, ball => {red=>1, round=>1, + toy=>1}, }; ## alternatively in the block below, generate a random data set with ## 500 items and 75 attributes with randomly 2-10 attributes per item =pod my $n_items = 500; my $n_attributes = 75; my $min_attr_in_item = 2; my $max_attr_in_item = 10; $item2attr = {}; for my $i (1..$n_items) { my $item = sprintf("it%03d",$i); my $n_attr = $min_attr_in_item + rand(1+$max_attr_in_item-$min_attr_ +in_item) ; my @attrs = sort ((map { sprintf("at%03d",$_) } (sort {rand() <=> r +and() } (1..75)))[0..$n_attr-1]); #printinfo($item,int(@attrs),@attrs); map {$item2attr->{$item}{$_} = 1} @attrs; } =cut # list of all items and attributes my @items = sort keys %$item2attr; my @attr = sort(uniq( map { keys %$_ } values %$item2attr)); # flip the hash my $attr2item; for my $attr (@attr) { map { $attr2item->{$attr}{$_} = $item2attr->{$_}{$attr} if $item2at +tr->{$_}{$attr} } @items; } report_sets($item2attr); report_sets($attr2item,-swap=>1); sub report_sets { my ($hash,%args) = @_; my $sets; for my $key (keys %$hash) { my $set_hash_str = join(",", sort keys %{$hash->{$key}}); $sets->{$set_hash_str}{$key}++; } for my $set_hash_str (keys %$sets) { my @attr = split(",",$set_hash_str); my @shared_attr = shared_items($hash,@attr); if($args{-swap}) { printinfo(int(@shared_attr),int(@attr),join(",",@shared_attr),jo +in(",",@attr)); } else { printinfo(int(@attr),int(@shared_attr),join(",",@attr),join(",", +@shared_attr)); } } } sub shared_items { my ($hash,@attr) = @_; my @shared_items; my @items = keys %$hash; for my $item (@items) { my $n = grep($hash->{$item}{$_}, @attr); push @shared_items, $item if $n == @attr; } return sort @shared_items; }

Replies are listed 'Best First'.
Re^2: Groups of Objects with Common Attributes
by vr (Curate) on May 17, 2018 at 17:22 UTC

    After couple days of looking up what on earth is clustering, k-means, etc. -- hopeless -- I took closer look at your code. Wow, so simple. Thanks, martink! Here is re-factored version, if I may, discarding all that was perceived superfluous and simplifying (to fit my brain). So it's, effectively, just 2 plain loops: over all attributes and all items. Loop over items doesn't add to example with pumpkins, but is required for other test cases.

    I wonder, is it a mathematical fact, that even for 500 items and 75 attributes, there can be no more than 575 sets of common attributes? It somewhat contradicts to what I remember from combinatorics.

    use strict; use warnings; use feature 'say'; use List::Util qw/ uniq all /; use Data::Dump 'dd'; my $item2attr = { apple => { red => 1, round => 1, plant => 1, fruit => 1 } +, orange => { orange => 1, round => 1, plant => 1, fruit => 1 } +, pumpkin => { orange => 1, round => 1, plant => 1, vegetable => 1 } +, ball => { red => 1, round => 1, toy => 1 }, }; # list of all items and attributes my @items = sort keys %$item2attr; my @attr = sort( uniq( map { keys %$_ } values %$item2attr )); # flip the hash my $attr2item; for my $attr ( @attr ) { for ( @items ) { $attr2item-> { $attr }{ $_ } = 1 if $item2attr-> { $_ }{ $attr } } } #dd $item2attr; #say '-----------------------------------'; #dd $attr2item; #say '-----------------------------------'; my %solutions; # hash, to prevent duplicates for ( @attr ) { my @items_ = keys %{ $attr2item-> { $_ }}; my @attr_ = grep { my $attr = $_; all { $item2attr-> { $_ }{ $attr }} @items_ } @attr; _add_solution( \@attr_, \@items_ ) } for ( @items ) { my @attr_ = keys %{ $item2attr-> { $_ }}; my @items_ = grep { my $item = $_; all { $attr2item-> { $_ }{ $item }} @attr_ } @items; _add_solution( \@attr_, \@items_ ) } dd values %solutions; # then filter solutions for required number of common # attributes, or find max set of common attributes, # or find max set of items with any common attributes, etc. sub _add_solution { # writes to %solutions my ( $attr, $items ) = @_; return unless $#$items; # skip uninteresting @$_ = sort @$_ for @_; $solutions{ join ',', @$attr } = [ scalar @$attr, # count of attributes scalar @$items, # count of items $attr, # attribute list $items # item list ] } __END__ ( [2, 2, ["red", "round"], ["apple", "ball"]], [2, 3, ["plant", "round"], ["apple", "orange", "pumpkin"]], [1, 4, ["round"], ["apple", "ball", "orange", "pumpkin"]], [3, 2, ["fruit", "plant", "round"], ["apple", "orange"]], [3, 2, ["orange", "plant", "round"], ["orange", "pumpkin"]], )

    Edit: fixed issue with sorting.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1214607]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (3)
As of 2024-03-29 06:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found