halley has asked for the wisdom of the Perl Monks concerning the following question:
%items = ( z => [ qw/one six/ ], y => [ qw/two three five/ ], x => [ qw/one two five/ ], ... );
Each item's keywords are unique. The arrays could just as easily be keys of hashes, but in this case, they just happen to be stored in arrays.
I can pretty easily scan the set of items for those with the most keywords, or scan for the set of keywords which appear in the most items.
What I would like to do is to figure out which pairs of keywords or triples of keywords or n-ples of keywords are shared by the most items. For example, if keyword A is found in ten items, and keyword B is found in twenty items, but only 1 item has both A and B, then it's a weak candidate. If nine items have both A and B, then it's a great candidate.
In Google terms, think of it this way: from the database of websites, what would be the best four-word query to include the most results links? And then the second-best four-word query? And the third-best four-word query? And so on...
Ideas? Does this align with some obscure or obvious algorithm I couldn't recognize?
Update: A test-case-generator function has been added below. Use this data if you'd like to do some benchmarking on your own.
Update 2: A sample output from an early timeline scanner has been added below.
--
[ e d @ h a l l e y . c c ]
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: algorithm for 'best subsets'
by kvale (Monsignor) on Mar 03, 2005 at 01:14 UTC | |
As for code to implement this measure, it is simple to use HoHs: Update: Note that this isn't a full correlation calculation, but implements the OPs desired numbers for a 2-way measure. As tall_man pointed out, I blew it :) Here is the corrected code:
-Mark | [reply] [d/l] [select] |
by tall_man (Parson) on Mar 03, 2005 at 01:19 UTC | |
| [reply] |
Re: algorithm for 'best subsets'
by BrowserUk (Patriarch) on Mar 03, 2005 at 03:26 UTC | |
This does all the pairs, triples, quads, quins and sextuplets in 6 seconds, producing 60,000+ lines in the process. With the 615,000 lines from 2 .. 10 keywords combinations taking around 2 1/2 minutes. My simple combinations generator chews a fair bit of memory though. An iterator would be prefereable. It relies on each item being representable by a single byte--which is okay upto 255 items if you map the real names to chars. The output format:
says that the 6 keywords 'two', 'four', etc. all appeared in each of items a, e, k, r, s, & y. Read more... (6 kB)
Examine what is said, not who speaks.
Silence betokens consent.
Love the truth but pardon error.
| [reply] [d/l] [select] |
by Roy Johnson (Monsignor) on Mar 03, 2005 at 18:11 UTC | |
Jumping on tall_man's idea to use Bit::Vector::Overload (and shamelessly stealing your data generator), here's a new solution. It's reasonably quick (about 15x faster than yours on my slow machine, though a chunk of the difference is printing time) to generate all the tuples and spit them out, nicely ordered by cardinality. There is much less output, because only tuples that actually represent the intersection of some pair of elements are included. When such a tuple is found, then the rest of the elements are checked to see if they should be included with it, so that the list for the tuple is complete. Read more... (3 kB)
Caution: Contents may have been coded under pressure. | [reply] [d/l] |
by halley (Prior) on Mar 03, 2005 at 14:41 UTC | |
I'll be adding more information and test cases shortly. -- | [reply] |
by fizbin (Chaplain) on Mar 03, 2005 at 15:30 UTC | |
This problem just exploded. Really. Can you give an estimate on: Of course, if L is sufficiently large, you're screwed too, since I don't think there's a way to not spend at least O(L!/(N!(L-N)!)) time.
| [reply] [d/l] |
by BrowserUk (Patriarch) on Mar 03, 2005 at 16:23 UTC | |
As long as you can represent the keywords by an integer by saying line number in a dictionary file (unsorted so that you can add new words to the end without remapping everything) or similar, then representing--permenantly, so that you do not have to regenerate the mappings each time--the items as bitvectors and ANDing them is going to be the quickest way to do the intersection. In effect, every document (item) would be indexed by a string that is a bitvector representing the (significant) words it contains. If you are excluding the usual stop words--those less than 3 chars, 'then', 'those', 'some' etc., then the length of your bitvectors should be reasonable. The problem comes with the combinatorics. By building an index of keywords to items, so that when you get a set of search terms, you can reduce the combinatorics to only those items containing the search terms will reduce the overall problem. You could try building a secondary index that maps pairs of keywords to items that contain them. That would rapidly reduce the size of the problem by limiting the combinations that need to be tested. But with the 2**(words in the English language + Proper nouns) you would need a huge database. If the problem was the usual "Given these keywords, find the (topN) documents that contain the most of them", it wouldn't be so bad, but you appear to want to generate the some sort of "best keyword sets", which I find confusing? I hate posts that ask me "What is it that you ultimately trying to achieve?", but given the shear scale of the numbers you are talking about, I think this may be a good time to ask such a question. Sorry :) Examine what is said, not who speaks.
Silence betokens consent.
Love the truth but pardon error.
| [reply] [d/l] |
Re: algorithm for 'best subsets'
by Roy Johnson (Monsignor) on Mar 03, 2005 at 04:14 UTC | |
Here's my entry: Read more... (2 kB)
Caution: Contents may have been coded under pressure. | [reply] [d/l] |
Re: algorithm for 'best subsets'
by Limbic~Region (Chancellor) on Mar 03, 2005 at 03:19 UTC | |
In between the commercials of Lost and Alias (literally), I came up with the following: Read more... (2 kB)
I think it does what you want but it is quite rough around the edges. I will see about cleaning it up tomorrow. Do you happen to have a bigger sample so that I can benchmark?
Cheers - L~R Update: Made minor cleanups to code | [reply] [d/l] |
Re: algorithm for 'best subsets'
by fizbin (Chaplain) on Mar 03, 2005 at 01:32 UTC | |
I'm a bit curious where this data is coming from.
| [reply] [d/l] [select] |
by fizbin (Chaplain) on Mar 03, 2005 at 19:50 UTC | |
Unfortunately, it does have a tendency to die of out-of-memory errors if you up either the number of keywords or the average length of a set, and tieing %totals to a db file doesn't seem to prevent it, so there must be some other sort of memory leak going on. (I suppose that it could also be the sort exploding, but working around that should be relatively straightforward)
| [reply] [d/l] [select] |
Re: algorithm for 'best subsets'
by dimar (Curate) on Mar 03, 2005 at 06:42 UTC | |
Ideas? Does this align with some obscure or obvious algorithm I couldn't recognize? Hi halley, you may have already seen this one before, but since you asked for an algorithm (either obvious or obscure #but not both?#) here is a link to the entry identified as 1.5.1 Clique from The Stony Brook Algorithm Repository. This immediately came to mind upon reading your question. To see how this applies to your scenario, simply picture each 'keyword' as a node in an undirected graph, and each n-tuple combination represented by a (multi)edge. The clusters with the most connections thus represent your 'popular cliques' that you want to identify and single out for special treatment, abuse, retribution, or whatever (just like in high school ;-). Sorry, I've no ready-to-run perl implementation like those other remarkable chaps, but the link has some useful background for anyone curious.
=oQDlNWYsBHI5JXZ2VGIulGIlJXYgQkUPxEIlhGdgY2bgMXZ5VGIlhGV
| [reply] |
Re: algorithm for 'best subsets'
by tall_man (Parson) on Mar 03, 2005 at 09:01 UTC | |
Update: Revised to collect the groups much better. I believe it will now do what Roy Johnson suggested. I didn't change any hashes to arrays, though. Timing on my machine for a 676-item test case generated by the benchmark program halley did:
Update2: I also tried a 17,576 item example (with 'kaaa' ... 'kzzz' and 'iaaa' .. 'izzz'). It ran for one hour to find all groups from 2 up to 4 (the maximum available in this case). The timing is consistent with O(I^2 * log K), where I is the item count and K is the keyword count. Update3: Inner loop optimization -- better ways to test for empty sets (is_empty) and count bits in sets (Norm). Went from one hour to 54 minutes on the biggest case. Read more... (3 kB)
| [reply] [d/l] [select] |
by Roy Johnson (Monsignor) on Mar 03, 2005 at 16:27 UTC | |
However, I think you figured combinations that weren't what the OP was looking for. You tell us what the intersection is of each tuple of keys: for example, b and c have two and five in common. I think what the OP wanted was the intersection of a tuple of values: two and five appear together in b, c, e, g, and h. If I can figure out what's what in your code, I will see if it can be made to do that. Also, you can use arrays instead of hashes for revipos and revkpos; then you can use @revipos everywhere you use sort keys %items (because that's all it is). Caution: Contents may have been coded under pressure. | [reply] |
Re: algorithm for 'best subsets'
by halley (Prior) on Mar 03, 2005 at 15:37 UTC | |
Update: Here's a useful results format:
-- | [reply] [d/l] [select] |
by fizbin (Chaplain) on Mar 03, 2005 at 16:36 UTC | |
Of course, when I did that my method blew through all available memory when trying to compute 5-at-a-time. It did manage 4-at-a-time, though, and in under five minutes.
| [reply] [d/l] [select] |
by halley (Prior) on Mar 03, 2005 at 21:19 UTC | |
In practice, I can slice the %Items and the %Keywords up a bit, and do smaller overlapping datasets. I can also farm the work out to multiple machines on these slices. More on what that is, tonight; I've been stuck in meetings today so haven't had the chance to really explain what all this is about. -- | [reply] |
by Limbic~Region (Chancellor) on Mar 03, 2005 at 21:27 UTC | |
by fizbin (Chaplain) on Mar 03, 2005 at 22:57 UTC | |
by BrowserUk (Patriarch) on Mar 03, 2005 at 22:04 UTC | |
Re: algorithm for 'best subsets'
by halley (Prior) on Mar 04, 2005 at 02:17 UTC | |
I'm working on reviving a personal project I started over twenty years ago, back in high school. I like to read and study timelines. That is, graphical maps which give some sort of contextual meaning to a set of events, by their ordering and relative pacing. So, as a quick proof of concept test, I have scraped a century's worth of Wikipedia pages which are organized by date. I make a node for each event that I scrape up. The event's keywords are naively assembled from the words that appear in the one-or-two-sentence summary of the event.
So, now that I have a huge database of events, I'd like to find out the historical context. For example, if I found that [ 'fyodor', 'dostoevsky' ] happens to be found in a useful number of events, I might want to make a sub-line that includes all his events. With a rich enough database, someone reading the resulting timeline might connect his trial to his most recent publications. It's relatively simple to look for pairs that are always together. The goal is to cover the inevitable holes by looking at constellations of keywords. For example, "Fyodor Dostoevsky" may appear together many times, but should I only map out events that explicitly mention his first name? What if "Dostoevsky" also appears with "author" and "Russian" on a regular basis? Then the database can hint to me that "Fyodor" and "poet" may also be an appropriate association. I can then investigate, hand-tune, and save the interesting queries as important sub-timelines. What's even more interesting is to show multiple, seemingly unconnected sub-timelines. Fyodor was tried in Russia. Who was the Czar during that period? Who was head of state in Poland and France? Was this before or after Harriet Beecher Stowe's "Uncle Tom's Cabin"? Given the potentially huge database, and the nature of historical influences, I know that I will have to work with only a few years at a time, or a few words at a time, or both. Solutions which use big memory are going to crash. Solutions which use little memory, can run for days, but can handle larger datasets are clearly winners in this application. I've been quite pleased with the involvement of the community here. I tossed out the question on a whim, and have been too busy today to reply to all your helpful responses as fast as I'd like. I'll definitely be toying with multiple approaches here to find the best balances and data analysis capabilities. An early positive result from my timeline query mechanism (you may need to adjust for wider output):
Eventually, I'll want to approach the Wikimedia folks with some results from their database, but the capability works for any sort of event timeline, from minute-by-minute tracking of space mission procedures, to the astronomic ages which explain how the Earth formed from star stuff. -- | [reply] [d/l] [select] |
by tall_man (Parson) on Mar 04, 2005 at 18:04 UTC | |
The module Graph::UnionFind implements it. Here is an example. Update: Fixed a bug in the code, and added code to separate out the partitions and count the length of each. Read more... (3 kB)
| [reply] [d/l] |
by BrowserUk (Patriarch) on Mar 04, 2005 at 19:59 UTC | |
Could you explain the format of %union_data? | [reply] [d/l] [select] |
by tall_man (Parson) on Mar 04, 2005 at 20:30 UTC | |
by BrowserUk (Patriarch) on Mar 04, 2005 at 21:11 UTC | |
| |
by halley (Prior) on Mar 05, 2005 at 04:16 UTC | |
I understand Bit::Vector, and making vectors that are the width of the keyword table, and marking the keyword bits for a given item. You don't seem to need the first gang of bit vectors at all; you are marking item bits in item-wide vectors, then never using those vectors. I understand what I read in Graph::UnionFind, but not its algorithm internally, but I might not need to. I think I understand the output it should give. What I don't understand is the way you're trying to combine these methods, and second-guessing the Graph::UnionFind's results on each loop. It seems to me that I can use G::UF without bit vectors at all, adding edges between correlated keywords, and then scan each partition for what keywords are in each partition. Can you speak more to your reasoning? -- | [reply] |
by tall_man (Parson) on Mar 05, 2005 at 15:12 UTC | |
by halley (Prior) on Mar 05, 2005 at 15:57 UTC | |
| |
Re: algorithm for 'best subsets'
by BrowserUk (Patriarch) on Mar 05, 2005 at 15:19 UTC | |
Update: Indeed, the same holds true for 50 other settings of srand. Halley I think you will have to improve your dataset generator. Currently, with srand = 12345 and rand function on my platform, it doesn't produce a single pair of items that share more than one keyword. As you can see below, with the exception of two partitions which only contain a single item each, the maximum number of keywords shared by items in each partition is 1: (Output right truncated for posting)
Examine what is said, not who speaks.
Silence betokens consent.
Love the truth but pardon error.
| [reply] [d/l] |