Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re: algorithm for 'best subsets'

by halley (Prior)
on Mar 04, 2005 at 02:17 UTC ( [id://436438]=note: print w/replies, xml ) Need Help??


in reply to algorithm for 'best subsets'

You ask, What in the hell has all this keyword data, and why would you want to find the queries with the widest hits?

I'm working on reviving a personal project I started over twenty years ago, back in high school. I like to read and study timelines. That is, graphical maps which give some sort of contextual meaning to a set of events, by their ordering and relative pacing.

So, as a quick proof of concept test, I have scraped a century's worth of Wikipedia pages which are organized by date. I make a node for each event that I scrape up. The event's keywords are naively assembled from the words that appear in the one-or-two-sentence summary of the event.

{ id => 6637, title => 'A Russian court sentences Fyodor Dostoevsky \ to death for anti-government activities linked \ to a radical intellectual group, but his \ execution is canceled at the last minute', epoch => 1, datum => '16-11-1849 AD', point => 1849.87440109514, kword => [ 'activities', 'anti', 'canceled', 'court', 'death', 'dostoevsky', 'execution', 'fyodor', 'government', 'group', 'intellectual', 'last', 'linked', 'minute', 'radical', 'russian', 'sentences' ], }

So, now that I have a huge database of events, I'd like to find out the historical context. For example, if I found that [ 'fyodor', 'dostoevsky' ] happens to be found in a useful number of events, I might want to make a sub-line that includes all his events. With a rich enough database, someone reading the resulting timeline might connect his trial to his most recent publications.

It's relatively simple to look for pairs that are always together. The goal is to cover the inevitable holes by looking at constellations of keywords. For example, "Fyodor Dostoevsky" may appear together many times, but should I only map out events that explicitly mention his first name? What if "Dostoevsky" also appears with "author" and "Russian" on a regular basis? Then the database can hint to me that "Fyodor" and "poet" may also be an appropriate association. I can then investigate, hand-tune, and save the interesting queries as important sub-timelines.

What's even more interesting is to show multiple, seemingly unconnected sub-timelines. Fyodor was tried in Russia. Who was the Czar during that period? Who was head of state in Poland and France? Was this before or after Harriet Beecher Stowe's "Uncle Tom's Cabin"?

Given the potentially huge database, and the nature of historical influences, I know that I will have to work with only a few years at a time, or a few words at a time, or both. Solutions which use big memory are going to crash. Solutions which use little memory, can run for days, but can handle larger datasets are clearly winners in this application.

I've been quite pleased with the involvement of the community here. I tossed out the question on a whim, and have been too busy today to reply to all your helpful responses as fast as I'd like. I'll definitely be toying with multiple approaches here to find the best balances and data analysis capabilities.

An early positive result from my timeline query mechanism (you may need to adjust for wider output):

Seeking for '+ford automobile model'... found 6 matches. Seeking for '+wright brothers orville wilbur'... found 2 matches. Seeking for 'boer +boers'... found 3 matches. Laying out 3 queries, and 11 events. Need a maximum of 2 lines. 11-10-1899 AD ^ Boer War begins: In South Africa, a war between the | United Kingdom and the Boers of the Transvaal and Or +ange | Free State erupts | 23-02-1900 AD | Boer War: Battle of Hart's Hill - In South Africa th +e | Boers and British troops battle | 10-03-1902 AD v Boer War: South African Boers win their last battle +over British forces, with the capture of a British genera +l and 200 of his men 23-07-1903 AD ^ Dr. Ernst Pfenning of Chicago, Illinois becomes the +first | owner of a Ford Model A | 17-12-1903 AD ^ | Orville Wright flies aircraft with a petrol engine i +n | | first documented successful controlled powered | | heavier-than-air flight | | 07-11-1910 AD v | First air flight for the purpose of delivering comme +rcial | freight occurs between Dayton, Ohio and Columbus, Oh +io by | the Wright Brothers and department store owner Max | Moorehouse | 03-11-1911 AD * Chevrolet officially enters the automobile market to | compete with the Ford Model T | 27-05-1927 AD * Ford Motor Company ceases manufacturing Ford Model T +s and | begins to retool plants to make Ford Model As | 02-12-1927 AD * Following 19 years of Ford Model T production, the F +ord | Motor Company unveils the Ford Model A as its new | automobile | 13-01-1942 AD * Henry Ford patents a plastic automobile, which is 30 +% | lighter than a regular car | 17-02-1972 AD v Sales of the Volkswagen Beetle model exceed those of + Ford Model-T (15 million)

Eventually, I'll want to approach the Wikimedia folks with some results from their database, but the capability works for any sort of event timeline, from minute-by-minute tracking of space mission procedures, to the astronomic ages which explain how the Earth formed from star stuff.

--
[ e d @ h a l l e y . c c ]

Replies are listed 'Best First'.
Re^2: algorithm for 'best subsets'
by tall_man (Parson) on Mar 04, 2005 at 18:04 UTC
    A divide-and-conquer approach may help you. There is a very fast algorithm (nearly order N) called UnionFind which will divide the problem into groups of items that have no keys in common. Then each partition could be treated separately.

    The module Graph::UnionFind implements it. Here is an example. It finds 153 partitions of the 17,576 node problem in a minute and 26 seconds on my machine. It finds 129 partitions in a minute and 14 seconds.

    Update: Fixed a bug in the code, and added code to separate out the partitions and count the length of each.

      Could you explain the format of %union_data?

    • What do the numeric keys represent> Item number?
    • What do the bitstring values represent?

      Examine what is said, not who speaks.
      Silence betokens consent.
      Love the truth but pardon error.
        There were bugs in my earlier code, but now the intended meaning of %union_data is actually fulfilled.

        Given two numeric keys for vertices in a graph, the UnionFind "union" operation creates an edge joining them and returns one of the two keys as a master key to the joined vertex set. What I am doing with %union_data is supplementing the algorithm -- under each master key, there is a bit-vector with the union of all bits that are true in any member of the joined set.

        In my implementation, the vertices are item numbers, so the key to %union_data is a subset of the item numbers. The binary values are bit vectors: bit [k] is true if keyword[k] is in the set.

        This allows me to extend the partition easily by testing each new item against the joined sets that have already been built.

      I've spent a good couple hours this evening trying to figure out the intent of this code, and actually using my own data. I'm not there yet.

      I understand Bit::Vector, and making vectors that are the width of the keyword table, and marking the keyword bits for a given item.

      You don't seem to need the first gang of bit vectors at all; you are marking item bits in item-wide vectors, then never using those vectors.

      I understand what I read in Graph::UnionFind, but not its algorithm internally, but I might not need to. I think I understand the output it should give.

      What I don't understand is the way you're trying to combine these methods, and second-guessing the Graph::UnionFind's results on each loop.

      It seems to me that I can use G::UF without bit vectors at all, adding edges between correlated keywords, and then scan each partition for what keywords are in each partition.

      Can you speak more to your reasoning?

      --
      [ e d @ h a l l e y . c c ]

        The first set of bit vectors was not used in the code I posted, but they are used in another version of the code I was working on. If I operate on them in the same way I do the second set of bit vectors, I should get a complete set of the items that were combined into each partition. It's useful for testing and for using the partitions later.

        I suppose you could use Graph::UnionFind by itself, but it would be horrendously inefficient. You would have to compare each pair of items to see if they intersected -- O(N^2). With my method (once I get it working right) I only have to test against the combined bit-set of each partition -- O(N*P) where P is the average number of partitions.

        If you really want to scale to tens of thousands of items, I think this is the right approach. However, trying to work from the outside of Graph::UnionFind was probably a mistake. I need to create a modified version that stores the bit-vectors as part of the internal structure. Then, as it re-organizes itself, it would be able to keep the information accurately.

        UnionFind uses a forest of n-ary trees which have only parent pointers. You can easily go from a child to the root, but not vice-versa. It can also "flatten" the tree by making more of the pointers go directly to the top. I believe the flattening is what is causing me to lose information.

        There are many references to the "union find" algorithm, some with animated Java applets. Here is one.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://436438]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (3)
As of 2024-03-29 14:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found