Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

there is an idea which bounces around the data-mining community that goes like this: the most important words are repeated only a few times in a document. take an article that talks about something specific, like unit testing in perl and see how many times the words "unit", "test" and "perl" appear in the text as opposed to the words "code", "module", "method", "error", etc. so this sorta-kinda-maybe justifys the following simplification:don't count the number af times a word appears in a document, just consider if it's there or not. so now we can use a bit vector as opposed to a "real" vector, sparse bit vectors have a wonderfully small representation: a list of all the indexes at 1. while i immagine you're set of known words will be quite big odds are the set of words in each document will be small, so this reppresentation will save you a lot of space.

anyway, you decide what the best in memory reppresentation is for your data, then i suggest you use that. this is a very specific application and one which, imho, doesn't map well onto the entity relation view rdbs have of the world. if your application can handle looong start-up times (most can) you could just use Data::Dumper to accasionaly "snapshot" the data, or maybe a prevalance-esque solution would be a good idea.

the question is: what do you want to know about that vector? you're not really interested in the vector position in some n-dimensional space, but all you care about is it's distance from other vectors in this space. well, how do you calculate that distance? odds are you'll do what the article says and take the cosine distance, so what info do we need about a document in order to take the cosine distance? we need the magnitude of both vectors and we need the inner product. the magnitude we can pre-calculate (the square root of the sum of the elements squared, we can avoid squaring since all our elements are either one or zero, we can't avoid square rooting but we only have to do this when we insert a new document, so that's not such a bg deal). how does the inner product work? it's sum_{i=0}^{n} a_i * b_i, which in our case (we're using sparse vectors here) is just the size of the intersection of the indexes in a and b, but since those volues were inserted in order (got that: keep the list of indexes ordered) this is a quick scan through the list of indexes in a or b (use whichever is shorter).

now the last thing we'd like to touch on is the fact that for every query we have to check it against each and every other document. you've got solutions to this: 1) query caching (i'm not going to go into this any further) 2) vector space partitioning. odds are you're going to have your documents distrubited not unifromly but into various sets of similar documents (clusters). clustering algocithms are still being researched so there's no real "answer" but the idea is this: every now and then (vague definition, i know) you go through the db and partiton it into documents which aren't orthogonal between themselves (ie sets of documents with the same words in them). then when you get a query you can limit your search to only the set of documents which must have a cos distance > 0. sorry if this last bit ins't as clear as i'd like.

hope this helps.


In reply to Re: Refining a 'vector space search'. by Anonymous Monk
in thread Refining a 'vector space search'. by Seumas

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (5)
As of 2024-04-19 03:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found