comment on

After couple days of looking up what on earth is clustering, k-means, etc. -- hopeless -- I took closer look at your code. Wow, so simple. Thanks, martink! Here is re-factored version, if I may, discarding all that was perceived superfluous and simplifying (to fit my brain). So it's, effectively, just 2 plain loops: over all attributes and all items. Loop over items doesn't add to example with pumpkins, but is required for other test cases.

I wonder, is it a mathematical fact, that even for 500 items and 75 attributes, there can be no more than 575 sets of common attributes? It somewhat contradicts to what I remember from combinatorics.

use strict;
use warnings;
use feature 'say';
use List::Util qw/ uniq all /;
use Data::Dump 'dd';

my $item2attr = {
    apple   => { red    => 1, round => 1, plant => 1, fruit     => 1 }
+,
    orange  => { orange => 1, round => 1, plant => 1, fruit     => 1 }
+,
    pumpkin => { orange => 1, round => 1, plant => 1, vegetable => 1 }
+,
    ball    => { red    => 1, round => 1, toy   => 1 },
};

# list of all items and attributes
my @items = sort keys %$item2attr;
my @attr  = sort( uniq( map { keys %$_ } values %$item2attr ));

# flip the hash 
my $attr2item;
for my $attr ( @attr ) {
    for ( @items ) {
        $attr2item-> { $attr }{ $_ } = 1
            if $item2attr-> { $_ }{ $attr }
    }
}
    
#dd $item2attr;
#say '-----------------------------------';
#dd $attr2item;
#say '-----------------------------------';

my %solutions;      # hash, to prevent duplicates

for ( @attr ) {
    my @items_ = keys %{ $attr2item-> { $_ }};
    
    my @attr_ = grep { 
        my $attr = $_;
        all { $item2attr-> { $_ }{ $attr }} @items_
    } @attr;

    _add_solution( \@attr_, \@items_ )
}

for ( @items ) { 
    my @attr_ = keys %{ $item2attr-> { $_ }};
    
    my @items_ = grep { 
        my $item = $_;
        all { $attr2item-> { $_ }{ $item }} @attr_
    } @items;

    _add_solution( \@attr_, \@items_ )
}

dd values %solutions;

# then filter solutions for required number of common 
# attributes, or find max set of common attributes,
# or find max set of items with any common attributes, etc.

sub _add_solution {             # writes to %solutions
    my ( $attr, $items ) = @_;
    
    return unless $#$items;     # skip uninteresting
    @$_ = sort @$_ for @_;

    $solutions{ join ',', @$attr } = [ 
        scalar @$attr,          # count of attributes
        scalar @$items,         # count of items
        $attr,                  # attribute list
        $items                  # item list
    ]
}

__END__

(
  [2, 2, ["red", "round"], ["apple", "ball"]],
  [2, 3, ["plant", "round"], ["apple", "orange", "pumpkin"]],
  [1, 4, ["round"], ["apple", "ball", "orange", "pumpkin"]],
  [3, 2, ["fruit", "plant", "round"], ["apple", "orange"]],
  [3, 2, ["orange", "plant", "round"], ["orange", "pumpkin"]],
)
[download]

Edit: fixed issue with sorting.

In reply to Re^2: Groups of Objects with Common Attributes by vr
in thread Groups of Objects with Common Attributes by Dev Null

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Syntactic Confectionery Delight
	PerlMonks