comment on

I have a data structure which is similar to the following. The actual structure is not important, just that we're dealing with a number of "sets" of keywords, some of which are shared between items, and many of which are not.

%items = ( z => [ qw/one six/ ],
           y => [ qw/two three five/ ],
           x => [ qw/one two five/ ],
           ...
         );
[download]

Each item's keywords are unique. The arrays could just as easily be keys of hashes, but in this case, they just happen to be stored in arrays.

I can pretty easily scan the set of items for those with the most keywords, or scan for the set of keywords which appear in the most items.

What I would like to do is to figure out which pairs of keywords or triples of keywords or n-ples of keywords are shared by the most items. For example, if keyword A is found in ten items, and keyword B is found in twenty items, but only 1 item has both A and B, then it's a weak candidate. If nine items have both A and B, then it's a great candidate.

In Google terms, think of it this way: from the database of websites, what would be the best four-word query to include the most results links? And then the second-best four-word query? And the third-best four-word query? And so on...

Ideas? Does this align with some obscure or obvious algorithm I couldn't recognize?

Update: A test-case-generator function has been added below. Use this data if you'd like to do some benchmarking on your own.

Update 2: A sample output from an early timeline scanner has been added below.

--
[ e d @ h a l l e y . c c ]

In reply to algorithm for 'best subsets' by halley

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


There's more than one way to do things
	PerlMonks