Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re^4: algorithm for 'best subsets'

by tall_man (Parson)
on Mar 04, 2005 at 20:30 UTC ( [id://436737]=note: print w/replies, xml ) Need Help??


in reply to Re^3: algorithm for 'best subsets'
in thread algorithm for 'best subsets'

There were bugs in my earlier code, but now the intended meaning of %union_data is actually fulfilled.

Given two numeric keys for vertices in a graph, the UnionFind "union" operation creates an edge joining them and returns one of the two keys as a master key to the joined vertex set. What I am doing with %union_data is supplementing the algorithm -- under each master key, there is a bit-vector with the union of all bits that are true in any member of the joined set.

In my implementation, the vertices are item numbers, so the key to %union_data is a subset of the item numbers. The binary values are bit vectors: bit [k] is true if keyword[k] is in the set.

This allows me to extend the partition easily by testing each new item against the joined sets that have already been built.

Replies are listed 'Best First'.
Re^5: algorithm for 'best subsets'
by BrowserUk (Patriarch) on Mar 04, 2005 at 21:11 UTC

    Update: Okay. I hadn't seen your updated code when I wrote this. One question though. Couldn't you just accumulate the item numbers involved in each partition as you go, rather than rediscovering them afterwards?

    Then I still don't understand what the resultant dataset in %union_data is?

    The keys of the hash are a subset of the items.

    But how does one item number represent a partition of the total items?

    The values are a bitmap representing a set of keywords.

    I think I understand that the value bitmaps represent an inclusive OR (union) of all the keywords found in a given partition, that have no intersection with any of the keywords in any other of the partitions? Is that correct?

    But there is no quick way(?) to determine which items are in each partition As each 'item number' key represents it's partition, not identifies it.

    And each value (keyword bitmap) is also composite, so there is no way back to the individual item/keywords sets(?) in order to do the n-ary unions, there either.

    Don't get me wrong, this is a quicker approach than the one I was persuing--reducing the n-ary unions by excluding those that didn't share at least n keywords in a pairwise union--but I'm struggling to see how you go forward from where your code leaves off?


    Examine what is said, not who speaks.
    Silence betokens consent.
    Love the truth but pardon error.
      It wouldn't be easy to accumulate the item numbers, because the data structure for UnionFind keeps changing as I go. At any given point, I can find out to which partition a vertex belongs to by looking it up with "find". So the easiest way to get the partitions is to gather them up at the end.

      However, I am still seeing a bug. Some of my "one-item" partitions seem to have shared keys with other partitions. Cases near the beginning of the item set (like "iaac") tend to have this problem. Trying to gather the items into partitions as I go along might help me find the bug.

      The purpose of this pass is to find all the completely distinct partitions. There's no point in looking for n-ary unions among things that have no bits in common. So I would apply the original algorithm to each subset.

      This pass will also help halley decide if the keywords are too generic. If it's all one big clump, there may be too many common words in the set.

        It's definitely a very valuable pass to make. It cuts down the combinatorics immensely, especially (as you said) if you can remove the most common words from the picture.

        I implemented it myself, because I couldn't work out how to accumulate the partiton items on the fly using Graph::UnionFind--for good reason it seems:).

        My homebrew implementation finds

      • 141 partitons in the 17576 keywords / 676 items set in .5 seconds.
      • 292 partitons in the 436697 keywords / 676 items set in 23 seconds.
      • 745 partitions in the 17676 keywords / 17676 items set in 18 seconds.
        [22:56:47.23] P:\test>436050-2 -W=3 -I=2 -NODETAILS Keywords: 16873 Items: 676 141 partitons found. [22:56:47.81] P:\test>436050-2 -W=4 -I=2 -NODETAILS Keywords: 438697 Items: 676 292 partitons found. [22:57:10.23] P:\test>436050-2 -W=3 -I=3 -NODETAILS Keywords: 16873 Items: 17576 745 partitons found. [22:57:28.48] P:\test>

        Which is good news because it means that it scales well in both directions.

        I tried the 438897K / 438697I combination, but even with avoiding the memory requirement of Gr::UnF's hashes, it still requires more memory than I have, and my disk is badly fragged so I am defragging it before trying again.

        Once you have partitioned, would you agree that it makes sense to do a pairwise combinations of the partitions and isolating those pairs within the partition that have nothing in common?


        Examine what is said, not who speaks.
        Silence betokens consent.
        Love the truth but pardon error.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://436737]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (4)
As of 2024-04-20 15:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found