Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
You ask, What in the hell has all this keyword data, and why would you want to find the queries with the widest hits?

I'm working on reviving a personal project I started over twenty years ago, back in high school. I like to read and study timelines. That is, graphical maps which give some sort of contextual meaning to a set of events, by their ordering and relative pacing.

So, as a quick proof of concept test, I have scraped a century's worth of Wikipedia pages which are organized by date. I make a node for each event that I scrape up. The event's keywords are naively assembled from the words that appear in the one-or-two-sentence summary of the event.

{ id => 6637, title => 'A Russian court sentences Fyodor Dostoevsky \ to death for anti-government activities linked \ to a radical intellectual group, but his \ execution is canceled at the last minute', epoch => 1, datum => '16-11-1849 AD', point => 1849.87440109514, kword => [ 'activities', 'anti', 'canceled', 'court', 'death', 'dostoevsky', 'execution', 'fyodor', 'government', 'group', 'intellectual', 'last', 'linked', 'minute', 'radical', 'russian', 'sentences' ], }

So, now that I have a huge database of events, I'd like to find out the historical context. For example, if I found that [ 'fyodor', 'dostoevsky' ] happens to be found in a useful number of events, I might want to make a sub-line that includes all his events. With a rich enough database, someone reading the resulting timeline might connect his trial to his most recent publications.

It's relatively simple to look for pairs that are always together. The goal is to cover the inevitable holes by looking at constellations of keywords. For example, "Fyodor Dostoevsky" may appear together many times, but should I only map out events that explicitly mention his first name? What if "Dostoevsky" also appears with "author" and "Russian" on a regular basis? Then the database can hint to me that "Fyodor" and "poet" may also be an appropriate association. I can then investigate, hand-tune, and save the interesting queries as important sub-timelines.

What's even more interesting is to show multiple, seemingly unconnected sub-timelines. Fyodor was tried in Russia. Who was the Czar during that period? Who was head of state in Poland and France? Was this before or after Harriet Beecher Stowe's "Uncle Tom's Cabin"?

Given the potentially huge database, and the nature of historical influences, I know that I will have to work with only a few years at a time, or a few words at a time, or both. Solutions which use big memory are going to crash. Solutions which use little memory, can run for days, but can handle larger datasets are clearly winners in this application.

I've been quite pleased with the involvement of the community here. I tossed out the question on a whim, and have been too busy today to reply to all your helpful responses as fast as I'd like. I'll definitely be toying with multiple approaches here to find the best balances and data analysis capabilities.

An early positive result from my timeline query mechanism (you may need to adjust for wider output):

Seeking for '+ford automobile model'... found 6 matches. Seeking for '+wright brothers orville wilbur'... found 2 matches. Seeking for 'boer +boers'... found 3 matches. Laying out 3 queries, and 11 events. Need a maximum of 2 lines. 11-10-1899 AD ^ Boer War begins: In South Africa, a war between the | United Kingdom and the Boers of the Transvaal and Or +ange | Free State erupts | 23-02-1900 AD | Boer War: Battle of Hart's Hill - In South Africa th +e | Boers and British troops battle | 10-03-1902 AD v Boer War: South African Boers win their last battle +over British forces, with the capture of a British genera +l and 200 of his men 23-07-1903 AD ^ Dr. Ernst Pfenning of Chicago, Illinois becomes the +first | owner of a Ford Model A | 17-12-1903 AD ^ | Orville Wright flies aircraft with a petrol engine i +n | | first documented successful controlled powered | | heavier-than-air flight | | 07-11-1910 AD v | First air flight for the purpose of delivering comme +rcial | freight occurs between Dayton, Ohio and Columbus, Oh +io by | the Wright Brothers and department store owner Max | Moorehouse | 03-11-1911 AD * Chevrolet officially enters the automobile market to | compete with the Ford Model T | 27-05-1927 AD * Ford Motor Company ceases manufacturing Ford Model T +s and | begins to retool plants to make Ford Model As | 02-12-1927 AD * Following 19 years of Ford Model T production, the F +ord | Motor Company unveils the Ford Model A as its new | automobile | 13-01-1942 AD * Henry Ford patents a plastic automobile, which is 30 +% | lighter than a regular car | 17-02-1972 AD v Sales of the Volkswagen Beetle model exceed those of + Ford Model-T (15 million)

Eventually, I'll want to approach the Wikimedia folks with some results from their database, but the capability works for any sort of event timeline, from minute-by-minute tracking of space mission procedures, to the astronomic ages which explain how the Earth formed from star stuff.

--
[ e d @ h a l l e y . c c ]


In reply to Re: algorithm for 'best subsets' by halley
in thread algorithm for 'best subsets' by halley

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (2)
As of 2024-04-19 20:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found