Here is my entry, making use of some nice operations from
Bit::Vector::Overload. It still has some rough spots in combining the final results, but it narrows things down very quickly. No combinatoric generators are used.
Update: Revised to collect the groups much better. I believe it will now do what Roy Johnson suggested. I didn't change any hashes to arrays, though.
Timing on my machine for a 676-item test case generated by the benchmark program halley did:
4.830u 0.000s 0:08.29 58.2% 0+0k 0+0io 420pf+0w
Update2: I also tried a 17,576 item example (with 'kaaa' ... 'kzzz' and 'iaaa' .. 'izzz'). It ran for one hour to find all groups from 2 up to 4 (the maximum available in this case). The timing is consistent with O(I^2 * log K), where I is the item count and K is the keyword count.
Update3: Inner loop optimization -- better ways to test for empty sets (is_empty) and count bits in sets (Norm). Went from one hour to 54 minutes on the biggest case.
#!/usr/bin/perl
use strict;
use warnings;
use Bit::Vector::Overload;
my %items = (
a => [ qw/one six/ ],
b => [ qw/two three five/ ],
c => [ qw/one two five/ ],
d => [ qw/one seven five/ ],
e => [ qw/one two five/ ],
f => [ qw/one two four seven/ ],
g => [ qw/one two five/ ],
h => [ qw/one two three five/ ],
);
my $icount = keys %items;
# Form a mapping from items to bit positions.
# Collect a list of bitmaps for combination work.
my $ix = 0;
my %ipos;
my @ilst;
for my $itm ( sort keys %items ) {
$ipos{$itm} = $ix;
my $set1 = new Bit::Vector($icount + 1);
$set1->Bit_On($ix);
push @ilst, $set1;
++$ix;
}
my %revipos = reverse(%ipos);
# Form a mapping from keywords to bit positions.
my $scount = 0;
my %kpos;
for my $itm ( sort keys %items ) {
for my $keyw ( @{ $items{ $itm }} ) {
$kpos{$keyw} = $scount++ if (!exists $kpos{$keyw});
}
}
# Also form a reverse index for later printing.
my %revkpos = reverse(%kpos);
# Form bit vectors with ones in the keyword positions,
# one for every item.
my %keyword_vecs;
my @lst1;
for my $itm ( sort keys %items ) {
my $set0 = new Bit::Vector($scount + 1);
$keyword_vecs{$itm} = $set0;
for my $keyw ( @{ $items{ $itm }} ) {
$set0->Bit_On($kpos{$keyw});
}
# hold both sets - items and keywords together.
push @lst1,[$ilst[$ipos{$itm}], $set0];
}
# Must have at least matching pairs.
my @lst2;
my %same_merger; # want to merge combos with common intersections.
my $i;
my $j;
my $imax = @lst1;
for ($i = 0; $i < $imax; ++$i) {
for ($j = $i+1; $j < $imax; ++$j) {
my $kcombo = $lst1[$i]->[1] & $lst1[$j]->[1];
next if $kcombo->is_empty();
next if $kcombo->Norm() < 2;
my $k = "$kcombo";
if (exists $same_merger{$k}) {
$same_merger{$k}->[0] |= $lst1[$i]->[0];
$same_merger{$k}->[0] |= $lst1[$j]->[0];
} else {
$same_merger{$k} = [ ($lst1[$i]->[0] | $lst1[$j]->[0]), $kcom
+bo ];
}
}
}
for (keys %same_merger) {
my $kref = $same_merger{$_};
my $icombo = $kref->[0];
my $kcombo = $kref->[1];
my @inames = @revipos{ $icombo->Index_List_Read() };
my @knames = @revkpos{ $kcombo->Index_List_Read() };
# Result could be externally sorted, or sort the lst2 array and the
+n print.
print scalar(@inames)," : @inames combo is @knames","\n";
push @lst2,[$icombo, $kcombo];
}