Re: algorithm for 'best subsets'

This does all the pairs, triples, quads, quins and sextuplets in 6 seconds, producing 60,000+ lines in the process.

With the 615,000 lines from 2 .. 10 keywords combinations taking around 2 1/2 minutes.

My simple combinations generator chews a fair bit of memory though. An iterator would be prefereable.

It relies on each item being representable by a single byte--which is okay upto 255 items if you map the real names to chars.

The output format:

6 two four five nine seventeen eighteen => aekrsy
[download]

says that the 6 keywords 'two', 'four', etc. all appeared in each of items a, e, k, r, s, & y.

#! perl -slw
use strict;
use List::Util qw[ shuffle ];
use Data::Dumper;

our $MAX ||= 6;

## A better Combinations iterator wouldn;t go amis.
sub Cnr{
    my( $n, @r ) = shift;
    return [] unless $n--;
    for my $x ( 0 .. ($#_ - $n) ) {
        push @r, map{
                [ $_[$x], @$_ ]
        } Cnr( $n, @_[ ($x + 1) .. $#_ ] );
    }
    return @r;
}

## Some keywords
my @keywords = qw[
    zero one two three four five six seven eight nine 
    ten eleven twelve thirteen fourteen 
    fifteen sixteen seventeen eighteen nineteen
];

## Generate some test data
my %items = map{
    $_ => [ @keywords[ ( shuffle( 0 .. $#keywords ) )[ 0 .. rand @keyw
+ords ] ] ]
} 'a' .. 'z';
print "$_ => @{ $items{ $_ } }" for sort keys %items;

## Build a byte mapped index
## Keywords to items
my %index = map{ $_ => chr(0) x 26 } @keywords ;
for my $item ( keys %items ) {
    substr $index{ $_ },  ord($item) - ord( 'a' ), 1, $item for @{ $it
+ems{ $item } };
}

my %combs;
## for combination of 2, 3, 4 .. keywords.
for my $n ( 2 .. $MAX ) {
    print "n-ary:$n";
    ## Generate the pairs triples etc
    for my $comb ( Cnr $n, @keywords ) {
        ## AND the byte maps together
        $combs{ "$n @$comb" } = $index{ $comb->[ 0 ] } & $index{ $comb
+->[ 1 ] };    
        $combs{ "$n @$comb" } &= $index{ $comb->[ $_ ] } for 2 .. $n-1
+;    
        ## and strip the nulls.
        $combs{ "$n @$comb" } =~ tr[\0][]d;
        ## Delete it, if no items with these keywords were found.
        delete $combs{ "$n @$comb" } unless length $combs{ "$n @$comb"
+ };
    }
}

## Display or file the results.
print "$_ => $combs{ $_ }" for sort{ local $^W; 0+$a <=> 0+$b } keys %
+combs;
__END__
a => six seven seventeen two sixteen nineteen three eighteen fifteen t
+welve eleven fourteen thirteen eight nine four one five ten zero
b => seven ten five two zero six one nine three seventeen
c => eleven four eighteen twelve fifteen seventeen seven six fourteen 
+nine zero sixteen three one
d => fifteen ten sixteen seventeen nineteen five two twelve four zero 
+thirteen eighteen eleven
e => zero eight seventeen eleven fourteen ten fifteen four eighteen ni
+neteen three two seven one five nine
f => ten five nine
g => fifteen nineteen seventeen eight ten seven
h => four eighteen one fourteen fifteen three seven zero twelve thirte
+en seventeen nineteen nine five
i => one two
j => seven five thirteen twelve one eighteen nineteen three
k => ten two fifteen six nineteen four seven nine twelve one eleven ei
+ghteen seventeen five
l => one five eighteen fourteen thirteen nineteen nine six four eleven
+ zero eight seven twelve
m => nineteen thirteen seven four ten three nine zero two fourteen eig
+ht five sixteen six twelve
n => eleven eighteen nineteen ten three twelve fifteen one fourteen ei
+ght thirteen six seven
o => eleven eight four eighteen seventeen sixteen nine twelve nineteen
+ zero thirteen three
p => zero ten twelve eleven eighteen two sixteen
q => ten fifteen sixteen eighteen eleven thirteen zero seven nineteen 
+three six seventeen fourteen one eight two four
r => zero nineteen fifteen thirteen one twelve three seventeen six fou
+r eight sixteen seven eleven nine five eighteen ten two fourteen
s => ten two zero fifteen five nine nineteen three sixteen one eightee
+n seventeen eight fourteen eleven seven six thirteen twelve four
t => seventeen four one ten zero nine fourteen
u => fifteen two
v => seven
w => fifteen eighteen seven nineteen
x => eight two
y => thirteen twelve fourteen fifteen seventeen sixteen three two five
+ ten zero nine one four eighteen nineteen
z => two nine seven sixteen one six zero fourteen seventeen four ninet
+een twelve fifteen eight eighteen
n-ary:2
n-ary:3
n-ary:4
n-ary:5
n-ary:6
2 six seventeen => abckqrsz
2 six nine => abcklmrsz
2 twelve nineteen => adhjklmnorsyz
...
3 zero thirteen fifteen => adhqrsy
3 eleven fourteen eighteen => acelnqrs
3 one ten eleven => aeknqrs
...
4 two three eight eleven => aeqrs
4 four seven ten fifteen => aekqrs
4 zero ten eleven sixteen => adpqrs
...
5 zero two five seven fifteen => aers
5 three six eight fourteen fifteen => anqrs
5 one ten eleven fourteen eighteen => aenqrs
...
6 two four five nine seventeen eighteen => aekrsy
6 one five six nine twelve eighteen => aklrs
6 one four six seven sixteen eighteen => acqrsz
[download]

Examine what is said, not who speaks.

Silence betokens consent.

Love the truth but pardon error.

Comment on Re: algorithm for 'best subsets' Select or Download Code

Replies are listed 'Best First'.
Re^2: algorithm for 'best subsets' by Roy Johnson (Monsignor) on Mar 03, 2005 at 18:11 UTC
Shouldn't you have MIN instead of MAX? You are, after all, interested in the things with the most in common. Jumping on tall_man's idea to use Bit::Vector::Overload (and shamelessly stealing your data generator), here's a new solution. It's reasonably quick (about 15x faster than yours on my slow machine, though a chunk of the difference is printing time) to generate all the tuples and spit them out, nicely ordered by cardinality. There is much less output, because only tuples that actually represent the intersection of some pair of elements are included. When such a tuple is found, then the rest of the elements are checked to see if they should be included with it, so that the list for the tuple is complete. Read more... (3 kB) Caution: Contents may have been coded under pressure.	[reply] [d/l]
Re^2: algorithm for 'best subsets' by halley (Prior) on Mar 03, 2005 at 14:41 UTC
I appreciate the technique, but the keywords are selected from the English language (and proper nouns), and the number of keyworded items could be in the tens of thousands, so I could not use any technique that limited the epsilon to 255~256 symbols. I'll be adding more information and test cases shortly. -- `[ e d @ h a l l e y . c c ]`	[reply]
Re^3: algorithm for 'best subsets' by fizbin (Chaplain) on Mar 03, 2005 at 15:30 UTC
Oh, eww. Many of the solutions so far will completely fall apart with a distinct keyword set that large. This problem just exploded. Really. Can you give an estimate on: K - the number of distinct keywords, I - the number of entries in the %items hash, L - the average number of keywords per %items entry, and N - the number of words taken at a time I suspect that many of these solutions will show either time or memory complexity of O(K!/(N!(K-N)!)) - I know that mine has the potential to show memory complexity like that (though the %totals has could be tied to a massive dbm file), and at least one of the other solutions has time complexity that looks like that. Of course, if L is sufficiently large, you're screwed too, since I don't think there's a way to not spend at least O(L!/(N!(L-N)!)) time. `-- @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/` [download]	[reply] [d/l]
Re^3: algorithm for 'best subsets' by BrowserUk (Patriarch) on Mar 03, 2005 at 16:23 UTC
As long as you can represent the keywords by an integer by saying line number in a dictionary file (unsorted so that you can add new words to the end without remapping everything) or similar, then representing--permenantly, so that you do not have to regenerate the mappings each time--the items as bitvectors and ANDing them is going to be the quickest way to do the intersection. In effect, every document (item) would be indexed by a string that is a bitvector representing the (significant) words it contains. If you are excluding the usual stop words--those less than 3 chars, 'then', 'those', 'some' etc., then the length of your bitvectors should be reasonable. The problem comes with the combinatorics. By building an index of keywords to items, so that when you get a set of search terms, you can reduce the combinatorics to only those items containing the search terms will reduce the overall problem. You could try building a secondary index that maps pairs of keywords to items that contain them. That would rapidly reduce the size of the problem by limiting the combinations that need to be tested. But with the 2**(words in the English language + Proper nouns) you would need a huge database. If the problem was the usual "Given these keywords, find the (topN) documents that contain the most of them", it wouldn't be so bad, but you appear to want to generate the some sort of "best keyword sets", which I find confusing? I hate posts that ask me "What is it that you ultimately trying to achieve?", but given the shear scale of the numbers you are talking about, I think this may be a good time to ask such a question. Sorry :) Examine what is said, not who speaks. Silence betokens consent. Love the truth but pardon error.	[reply] [d/l]


Keep It Simple, Stupid
	PerlMonks