comment on

As long as you can represent the keywords by an integer by saying line number in a dictionary file (unsorted so that you can add new words to the end without remapping everything) or similar, then representing--permenantly, so that you do not have to regenerate the mappings each time--the items as bitvectors and ANDing them is going to be the quickest way to do the intersection.

In effect, every document (item) would be indexed by a string that is a bitvector representing the (significant) words it contains.

If you are excluding the usual stop words--those less than 3 chars, 'then', 'those', 'some' etc., then the length of your bitvectors should be reasonable.

The problem comes with the combinatorics.

By building an index of keywords to items, so that when you get a set of search terms, you can reduce the combinatorics to only those items containing the search terms will reduce the overall problem.

You could try building a secondary index that maps pairs of keywords to items that contain them. That would rapidly reduce the size of the problem by limiting the combinations that need to be tested. But with the 2**(words in the English language + Proper nouns) you would need a huge database.

If the problem was the usual "Given these keywords, find the (topN) documents that contain the most of them", it wouldn't be so bad, but you appear to want to generate the some sort of "best keyword sets", which I find confusing?

I hate posts that ask me "What is it that you ultimately trying to achieve?", but given the shear scale of the numbers you are talking about, I think this may be a good time to ask such a question. Sorry :)

Examine what is said, not who speaks.

Silence betokens consent.

Love the truth but pardon error.

In reply to Re^3: algorithm for 'best subsets' by BrowserUk
in thread algorithm for 'best subsets' by halley

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Do you know where your variables are?
	PerlMonks