P is for Practical PerlMonks

### comment on

 Need Help??

You can "deal" with this using hierarchical clustering, in the sense that it might narrow a large problem into a set of many smaller problems that you can brute force.

First, for each text attribute you assign a dimension, so that your objects are n-vectors that look like, for example, [1,0,0,1,0], if you had 5 text attributes and this object that the first and fourth attribute.

Then you plug these entries into a hierarchical clustering algorithm. You can cut the tree at any level and look at how objects have been grouped. This will let you identify objects that are close together in this space (i.e. share many attributes).

What I've done below is an experiment in which I first calculate the sets of all items for which pairwise distances are the same and then hierarchically cluster each set.

What you can then do is look into each set and find all combinations and find their shared attributes. For example, the largest intersection is apple,orange,pumpkin which is plant,round. But within this set, orange,pumpkin share orange,plant,round - this doesn't get picked up by the clustering ;/

Here is some code that produces the following output, which I've trimmed for display

```./pairs | grep cut | cut -d " " -f 4- | sort -u | sort -nr -k3
cluster,items,attr 0 3 2 apple,orange,pumpkin plant,round
cluster,items,attr 0 3 1 ball,orange,pumpkin round
cluster,items,attr 0 3 1 apple,ball,pumpkin round
cluster,items,attr 1 2 2 apple,pumpkin plant,round
cluster,items,attr 1 2 1 ball,pumpkin round
cluster,items,attr 2 1 4 pumpkin orange,plant,round,vegetable
cluster,items,attr 1 1 4 apple fruit,plant,red,round
cluster,items,attr 1 1 3 ball red,round,toy
cluster,items,attr 0 1 4 orange fruit,orange,plant,round
cluster,items,attr 0 1 4 apple fruit,plant,red,round

A bit messy:

```use List::Util (sum);
use List::MoreUtils qw(uniq);
use Algorithm::Cluster;

my \$data = {
apple   => [qw(       red round plant fruit)],
orange  => [qw(orange     round plant fruit)],
pumpkin => [qw(orange     round plant vegetable)],
ball    => [qw(       red round                 toy)],
};

# list of all attributes
my @items = sort keys %\$data;
my @attr = sort(uniq( map { @{\$data->{\$_}} } keys %\$data));

# data set recast as vectors
my \$datav;
for my \$item (@items) {
my \$vec;
for my \$attr (@attr) {
push @\$vec, scalar grep(\$_ eq \$attr, @{\$data->{\$item}});
}
push @\$datav, \$vec;
}

# keep track of all item sets for which items had the same distance
my \$scores;
for my \$i (0..@items-1) {
for my \$j (\$i+1..@items-1) {
my \$score = pair_score(\$datav->[\$i],\$datav->[\$j]);
push @{\$scores->{\$score}}, (\$i,\$j);
}
}

# now go through the sets of items with
# same scores and hierarchically cluster them
# based on a distance matrix generated by the score function
for my \$score (sort {\$b <=> \$a} keys %\$scores) {
my @item_idx = uniq(@{\$scores->{\$score}});
printf("score %d items %s\n",\$score,join(",",@items[@item_idx]));
cluster_and_report(@item_idx);
}

# distances matrix
sub distance_matrix {
my @item_idx = @_;
my \$distances;
for my \$i (0..@item_idx-1) {
push @{\$distances},[];
for my \$j (0..\$i-1) {
push @{\$distances->[-1]}, pair_score( \$datav->[\$item_idx[\$i]],\$d
+atav->[\$item_idx[\$j]] );
}
}
return \$distances;
}

# decide how to score a pair of items
sub pair_score {
my (\$x,\$y) = @_;
my \$score = 0;
for my \$i (0..@\$x-1) {
if(\$x->[\$i] == \$y->[\$i]) {
\$score += \$x->[\$i]; # +1 score for a shared attribute
} else {
#\$score--; # potential penalty for unshared attributes
}
}
return \$score;
}

sub cluster_and_report {
my @item_idx = @_;
my \$zerov = [ map { 0 } @attr ];

my %param = (
data      => distance_matrix(@item_idx),
mask      => [ map { \$zerov } @item_idx ],
weight    => [ map { 1 } @attr ],
transpose => 0,
dist      => "e",
method    => "s",
);

my \$tree = Algorithm::Cluster::treecluster(%param);

for my \$cut_level (1..int(@item_idx)) {
my (\$clusters)  = \$tree->cut(\$cut_level);
#printdumper(\$clusters);
my @cluster_ids = uniq(@\$clusters);
for my \$cluster_id (@cluster_ids) {
my @cluster_item_idx = map { \$item_idx[\$_] } grep(\$clusters->[\$_
+] == \$cluster_id, (0..@item_idx-1));
my @shared_vector    = shared_vector( map { \$datav->[\$_] } @clus
+ter_item_idx);
my @shared_attr      = map { \$attr[\$_] } grep(\$shared_vector[\$_]
+, (0..@shared_vector-1));
printinfo(sprintf("cut level %d cluster,items,attr %d %d %d %s %
+s",
\$cut_level,
\$cluster_id,
int(@cluster_item_idx),
int(@shared_attr),
join(",",@items[@cluster_item_idx]),
join(",",@shared_attr)));
}
}
}

# use the shared attributes as a string to find sets of
# items that share same attribute
sub shared_vector {
my @datav = @_;
my \$shared;
for my \$i (0..@{\$datav[0]}-1) {
if(grep(\$_->[\$i] == 1, @datav) == @datav) {
push @\$shared, 1;
} else {
push @\$shared, 0;
}
}
return @\$shared;
}

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

• Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
• Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
• Read Where should I post X? if you're not absolutely sure you're posting in the right place.
• Posts may use any of the Perl Monks Approved HTML tags:
a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
• You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
 For: Use: & & < < > > [ [ ] ]
• Link using PerlMonks shortcuts! What shortcuts can I use for linking?

Create A New User
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (7)
As of 2021-01-18 02:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
The STEM quote I most wish I'd made is:

Results (177 votes). Check out past polls.

Notices?