Clear questions and runnable code get the best and fastest answer |
|
PerlMonks |
Re: algorithm for 'best subsets'by halley (Prior) |
on Mar 04, 2005 at 02:17 UTC ( [id://436438]=note: print w/replies, xml ) | Need Help?? |
You ask, What in the hell has all this keyword data, and why would you want to find the queries with the widest hits?
I'm working on reviving a personal project I started over twenty years ago, back in high school. I like to read and study timelines. That is, graphical maps which give some sort of contextual meaning to a set of events, by their ordering and relative pacing. So, as a quick proof of concept test, I have scraped a century's worth of Wikipedia pages which are organized by date. I make a node for each event that I scrape up. The event's keywords are naively assembled from the words that appear in the one-or-two-sentence summary of the event.
So, now that I have a huge database of events, I'd like to find out the historical context. For example, if I found that [ 'fyodor', 'dostoevsky' ] happens to be found in a useful number of events, I might want to make a sub-line that includes all his events. With a rich enough database, someone reading the resulting timeline might connect his trial to his most recent publications. It's relatively simple to look for pairs that are always together. The goal is to cover the inevitable holes by looking at constellations of keywords. For example, "Fyodor Dostoevsky" may appear together many times, but should I only map out events that explicitly mention his first name? What if "Dostoevsky" also appears with "author" and "Russian" on a regular basis? Then the database can hint to me that "Fyodor" and "poet" may also be an appropriate association. I can then investigate, hand-tune, and save the interesting queries as important sub-timelines. What's even more interesting is to show multiple, seemingly unconnected sub-timelines. Fyodor was tried in Russia. Who was the Czar during that period? Who was head of state in Poland and France? Was this before or after Harriet Beecher Stowe's "Uncle Tom's Cabin"? Given the potentially huge database, and the nature of historical influences, I know that I will have to work with only a few years at a time, or a few words at a time, or both. Solutions which use big memory are going to crash. Solutions which use little memory, can run for days, but can handle larger datasets are clearly winners in this application. I've been quite pleased with the involvement of the community here. I tossed out the question on a whim, and have been too busy today to reply to all your helpful responses as fast as I'd like. I'll definitely be toying with multiple approaches here to find the best balances and data analysis capabilities. An early positive result from my timeline query mechanism (you may need to adjust for wider output):
Eventually, I'll want to approach the Wikimedia folks with some results from their database, but the capability works for any sort of event timeline, from minute-by-minute tracking of space mission procedures, to the astronomic ages which explain how the Earth formed from star stuff. --
In Section
Seekers of Perl Wisdom
|
|