comment on

As a naive way, why don't you try to collapse the unique words with high correlation into statistical synonyms?

If a group of unique words tend to occur in the same document often. You could just choose one as representative in your vector, the others stored as synonyms for that representative. (And yes, that would make your search a two-step process.)

As for what value asigned to a representative word, there's no theoretically the best way, but the sum of the frequency of all correlated words is not the way at all, as it will grossly bias the vector.

An unsophisticated way to do it is to simply take the simple average frequency. A "better" way would be to use factor analysis or any dimension reduction technique in statistics to empirically figure out the weights for different words so as to come up with a weighted average that way.

In reply to Re: Refining a 'vector space search'. by chunlou
in thread Refining a 'vector space search'. by Seumas

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Do you know where your variables are?
	PerlMonks