Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: Refining a 'vector space search'.

by dws (Chancellor)
on Jun 28, 2003 at 16:56 UTC ( [id://269903]=note: print w/replies, xml ) Need Help??


in reply to Refining a 'vector space search'.

And the more unique words you have in all your combined documents, the more places/values you have. In one test, I calculated that there would currently be 22,000+ values in each vector/array (most are zeroes).

Was that count before or after stemming? When I played with vector space searching, stemming reduced the number of "words" significantly. But yes, you do end up with big vectors.

Replies are listed 'Best First'.
Re: Re: Refining a 'vector space search'.
by Seumas (Curate) on Jun 28, 2003 at 17:09 UTC
    This was after stemming, but it was using the default set of words. Either way, with so many documents (and all of them being written by a different person -- these are auction descriptions) the variety and number of truly unique and non-stopped words is going to be really high.

    Can you (or anyone) confirm for me that my understanding is correct, in that you need the entire vector for each document to compare against your search query's vector when you cosine them? That is, you couldn't perform some function ahead of time on your vector to reduce the size of it (yet still have the same representation) and have it be compared properly to the search query vectors? Since the two are so interrelated, I don't know that there could be a way to do that since each search is unique and performing some shrinking computation on it ahead of time would destroy the ability to gather results against each different set of query words...

    Like I say, I was just sort of hoping against hope here. I need a viable solution for searching on my site because I have hundreds of thousands of old (closed) auctions, thousands of existing (open) ones and then hundreds of thousands of posts in the forums. So as you can see, simple SQL queries are really not cutting it any longer - both in speed and accuracy (not to mention flexibility). I'm not even sure that I could properly use a reverse index method on this type of setup. Aaaargh.

      You can get away with a sparse representation of the vector, i.e. you can keep just the positions of the elements of the vector that are non-zero. Say the vector looks like:

      (0, 1, 0, 0, 0, 1, 0, 0, 1, 0)
      than you can encode it by a list containing just:
      ( 1, 5, 8 )
      This can be a huge space saver if the vector is really large. It is very easy to calculate the cosine between vectors in that representation: just count the number of common elements in the lists and divide by the square root of product of the length of each of the two lists.

      Hope this helps, -gjb-

        Is it enough to represent the values in 0/non-zero like that? That is, do we just need to know that "places 4, 11, 145, 519, 1238 are all non-zero"? I mean, in the vector I displayed as the example above, you see that they are broken down into what appears to be cos themselves(?) and everything is either 0 or a point between 0 and 1. So when performing a calculation on your above example, would "places 1,5 and 8 are positive" be just as fruitful as "place 1 is 0.02453, place 5 is 0.42128 and place 8 is 0.242112"?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://269903]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (7)
As of 2024-04-23 09:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found