Re^2: algorithm for 'best subsets'

Replies are listed 'Best First'.
Re^3: algorithm for 'best subsets' by fizbin (Chaplain) on Mar 03, 2005 at 15:30 UTC
Oh, eww. Many of the solutions so far will completely fall apart with a distinct keyword set that large. This problem just exploded. Really. Can you give an estimate on: K - the number of distinct keywords, I - the number of entries in the %items hash, L - the average number of keywords per %items entry, and N - the number of words taken at a time I suspect that many of these solutions will show either time or memory complexity of O(K!/(N!(K-N)!)) - I know that mine has the potential to show memory complexity like that (though the %totals has could be tied to a massive dbm file), and at least one of the other solutions has time complexity that looks like that. Of course, if L is sufficiently large, you're screwed too, since I don't think there's a way to not spend at least O(L!/(N!(L-N)!)) time. `-- @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/` [download]	[reply] [d/l]
Re^3: algorithm for 'best subsets' by BrowserUk (Patriarch) on Mar 03, 2005 at 16:23 UTC
As long as you can represent the keywords by an integer by saying line number in a dictionary file (unsorted so that you can add new words to the end without remapping everything) or similar, then representing--permenantly, so that you do not have to regenerate the mappings each time--the items as bitvectors and ANDing them is going to be the quickest way to do the intersection. In effect, every document (item) would be indexed by a string that is a bitvector representing the (significant) words it contains. If you are excluding the usual stop words--those less than 3 chars, 'then', 'those', 'some' etc., then the length of your bitvectors should be reasonable. The problem comes with the combinatorics. By building an index of keywords to items, so that when you get a set of search terms, you can reduce the combinatorics to only those items containing the search terms will reduce the overall problem. You could try building a secondary index that maps pairs of keywords to items that contain them. That would rapidly reduce the size of the problem by limiting the combinations that need to be tested. But with the 2**(words in the English language + Proper nouns) you would need a huge database. If the problem was the usual "Given these keywords, find the (topN) documents that contain the most of them", it wouldn't be so bad, but you appear to want to generate the some sort of "best keyword sets", which I find confusing? I hate posts that ask me "What is it that you ultimately trying to achieve?", but given the shear scale of the numbers you are talking about, I think this may be a good time to ask such a question. Sorry :) Examine what is said, not who speaks. Silence betokens consent. Love the truth but pardon error.	[reply] [d/l]


Do you know where your variables are?
	PerlMonks