Re: Refining a 'vector space search'.

there is an idea which bounces around the data-mining community that goes like this: the most important words are repeated only a few times in a document. take an article that talks about something specific, like unit testing in perl and see how many times the words "unit", "test" and "perl" appear in the text as opposed to the words "code", "module", "method", "error", etc. so this sorta-kinda-maybe justifys the following simplification:don't count the number af times a word appears in a document, just consider if it's there or not. so now we can use a bit vector as opposed to a "real" vector, sparse bit vectors have a wonderfully small representation: a list of all the indexes at 1. while i immagine you're set of known words will be quite big odds are the set of words in each document will be small, so this reppresentation will save you a lot of space.

anyway, you decide what the best in memory reppresentation is for your data, then i suggest you use that. this is a very specific application and one which, imho, doesn't map well onto the entity relation view rdbs have of the world. if your application can handle looong start-up times (most can) you could just use Data::Dumper to accasionaly "snapshot" the data, or maybe a prevalance-esque solution would be a good idea.

the question is: what do you want to know about that vector? you're not really interested in the vector position in some n-dimensional space, but all you care about is it's distance from other vectors in this space. well, how do you calculate that distance? odds are you'll do what the article says and take the cosine distance, so what info do we need about a document in order to take the cosine distance? we need the magnitude of both vectors and we need the inner product. the magnitude we can pre-calculate (the square root of the sum of the elements squared, we can avoid squaring since all our elements are either one or zero, we can't avoid square rooting but we only have to do this when we insert a new document, so that's not such a bg deal). how does the inner product work? it's sum_{i=0}^{n} a_i * b_i, which in our case (we're using sparse vectors here) is just the size of the intersection of the indexes in a and b, but since those volues were inserted in order (got that: keep the list of indexes ordered) this is a quick scan through the list of indexes in a or b (use whichever is shorter).

now the last thing we'd like to touch on is the fact that for every query we have to check it against each and every other document. you've got solutions to this: 1) query caching (i'm not going to go into this any further) 2) vector space partitioning. odds are you're going to have your documents distrubited not unifromly but into various sets of similar documents (clusters). clustering algocithms are still being researched so there's no real "answer" but the idea is this: every now and then (vague definition, i know) you go through the db and partiton it into documents which aren't orthogonal between themselves (ie sets of documents with the same words in them). then when you get a query you can limit your search to only the set of documents which must have a cos distance > 0. sorry if this last bit ins't as clear as i'd like.

hope this helps.

Comment on Re: Refining a 'vector space search'.


"be consistent"
	PerlMonks