Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re^2: algorithm for 'best subsets'

by halley (Prior)
on Mar 03, 2005 at 14:41 UTC ( [id://436223]=note: print w/replies, xml ) Need Help??


in reply to Re: algorithm for 'best subsets'
in thread algorithm for 'best subsets'

I appreciate the technique, but the keywords are selected from the English language (and proper nouns), and the number of keyworded items could be in the tens of thousands, so I could not use any technique that limited the epsilon to 255~256 symbols.

I'll be adding more information and test cases shortly.

--
[ e d @ h a l l e y . c c ]

Replies are listed 'Best First'.
Re^3: algorithm for 'best subsets'
by fizbin (Chaplain) on Mar 03, 2005 at 15:30 UTC
    Oh, eww. Many of the solutions so far will completely fall apart with a distinct keyword set that large.

    This problem just exploded. Really. Can you give an estimate on:

    • K - the number of distinct keywords,
    • I - the number of entries in the %items hash,
    • L - the average number of keywords per %items entry, and
    • N - the number of words taken at a time
    I suspect that many of these solutions will show either time or memory complexity of O(K!/(N!(K-N)!)) - I know that mine has the potential to show memory complexity like that (though the %totals has could be tied to a massive dbm file), and at least one of the other solutions has time complexity that looks like that.

    Of course, if L is sufficiently large, you're screwed too, since I don't think there's a way to not spend at least O(L!/(N!(L-N)!)) time.

    -- @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/
Re^3: algorithm for 'best subsets'
by BrowserUk (Patriarch) on Mar 03, 2005 at 16:23 UTC

    As long as you can represent the keywords by an integer by saying line number in a dictionary file (unsorted so that you can add new words to the end without remapping everything) or similar, then representing--permenantly, so that you do not have to regenerate the mappings each time--the items as bitvectors and ANDing them is going to be the quickest way to do the intersection.

    In effect, every document (item) would be indexed by a string that is a bitvector representing the (significant) words it contains.

    If you are excluding the usual stop words--those less than 3 chars, 'then', 'those', 'some' etc., then the length of your bitvectors should be reasonable.

    The problem comes with the combinatorics.

    By building an index of keywords to items, so that when you get a set of search terms, you can reduce the combinatorics to only those items containing the search terms will reduce the overall problem.

    You could try building a secondary index that maps pairs of keywords to items that contain them. That would rapidly reduce the size of the problem by limiting the combinations that need to be tested. But with the 2**(words in the English language + Proper nouns) you would need a huge database.

    If the problem was the usual "Given these keywords, find the (topN) documents that contain the most of them", it wouldn't be so bad, but you appear to want to generate the some sort of "best keyword sets", which I find confusing?

    I hate posts that ask me "What is it that you ultimately trying to achieve?", but given the shear scale of the numbers you are talking about, I think this may be a good time to ask such a question. Sorry :)


    Examine what is said, not who speaks.
    Silence betokens consent.
    Love the truth but pardon error.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://436223]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (3)
As of 2024-04-25 21:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found