http://qs321.pair.com?node_id=269898

Seumas has asked for the wisdom of the Perl Monks concerning the following question:

In an earlier thread (Fuzzy matching to user's tastes?), Tilly pointed me to an article on Building a Vector Space Search Engine in Perl. It seems that if I could implement this, it would solve a number of my problems. There are a couple of downsides, however and I'm hoping that fellow monks can offer solutions to them. I have the mathematical knowledge to grasp the concept but not enough to expand on it in the way I believe I need.

For reference, here is the source code I'm working with: http://www.perl.com/2003/02/19/examples/VectorSpace.pm.

In short, a set of documents are parsed for unique words, stop-words are removed, the number of appearances of each word in each document are referenced and we build an index of vectors/documents to evaluate against our search keywords (which themselves are turned into vectors before comparing). The closer two vectors are together, the more likely they are related (or at least of interest).



That's all well and good, but I'm running into a snag because all of this (as written) occurs in memory. So every time the script is run, it's going to parse, analyze and index every document all over again. All of these documents are stored in Postgresql and there are approximately 5,000 of them at any one time (currently) and they have a turnover of several hundred per day (meaning old documents expire and new documents are created).

So I wanted to perform as much of the parsing and calculation as I could ahead of time. Perhaps a few times per day. Then store all of that in a database table that would consist of two columns. One column would be a calculated vector or score or some sort of cumulative value of some sort and the other would be the document identifier (so I know what document it is regarding).

However, if I understand all of this correctly, each document's result is not one single value, but an array of values indicating the frequency of each indexed word (each point in the array represents a word that either is or isn't there). PDL uses piddle() to reduce the size of this, but I'm not sure if it reduces it to a single string or not (I've achieved different results using Data::Dumper to figure it out).

So, the results that are returned for each document vector look something like this:

[0 0.12031612 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.024063225 + 0 0 0 0.024063225 0 0 0.024063225 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0.024063225 0 0 0 0 0 0 0 0 0 0 0 0 0 0.024063225 0 0 0 0]


And the more unique words you have in all your combined documents, the more places/values you have. In one test, I calculated that there would currently be 22,000+ values in each vector/array (most are zeroes). So if you multipley 22,000 by 5,000 documents, you're dealing with (I think) about 100MB! That's a lot to process every single time a user wants to perform a search and with thousands of users performing searches each day, this would just kill performance across the board.

So, what I'm hoping... is that someone out here will be able to offer me (in a way a stupid person could comprehend) a way to represent the same information in a far simpler manner. Instead of having a database table with a six digit document identifier and a 22,000 value vector in the other column, it would be great if I could have a six digit document identifier and a ten or twelve digit value that I could still compare vectored search queries against without additional processing.

Is this a pipe dream? Am I even explaining myself clearly? *grin*

Thanks for your time.