Re: Refining a 'vector space search'.

It's been a few years since I've done math, but checking definitions at wolfram.com convincecs me the following is correct. Just don't use the information in any life-or-death situation.

The reason you have a value with floating point values, above, is because the vector has been normalized. But if you only care about existence, not frequency, both storage and procecssing speed can be drasticallyimproved.

A vector, V=(v1, v2, v3, ...., vn) can also be described as a magnitude and a set of angles. The magnitude, or length, is called the norm, denoted '|V|'. While there are severral norms, the comon one is the two-norm: in two dimensions, that would be Pythagorus' Theorem

h^2 = a^2 + b^2.

or for a longer vector,

|V| = sqrt( v1^2 + v2^2 + ... vn^2 ).

The dot product or inner product of two vectors, V & W, is calculated as :

v1 * w1 + v2 * w2 + v3 * w3 + ... + vn * wn
[download]

By definition,

V * W = |V| * |W| * cos( theta )

where theta is the angle between the vectors. Dividing both sides by the magnitudes of the vectors,

  V * W
---------   = cos ( theta )
|V| * |W|

or

VV = V / |V|; WW = W / |W|

  VV * WW     VV * WW 
----------- = ------- = cos( theta )
|VV| * |WW|    1 * 1
[download]

That is, if you normalize each vector beforehand, by dividing each v1, v2, v3, ... by the total length of the vector, the norm of the vector is one, thus the denominator, |VV| * |WW| is simply 1*1. Performing the normalization beforehand saves two norms, a multiplication and a division, but requires storing huge arrays of floating point numbers.

However, since we only care whether words are present, not their frequency, and since only a limited number of words appears in any one document, everything can be made a whole lot simpler!

A vector, V, contains a set of ones and zeros, indicating whether a word is present: [ 0, 1, 0, 0, 0, 1, 0, 0, 0, 1 .... ]. When we calculate V * W = vi and wi are either both 1, in which case that term is 1, or else the term is 0. Thus, instead of multipliying the vectors, we can represent the vectors using bitmaps and Boolean AND the two bitmaps together to determine which terms are set. Then add all the set terms to get the dot product. Since the vectors are sparse, just keep track of the indices of the set bits. Go through the two vectors in parallel, to detect mutually set bits. Since the arrays are ripped apart, we need to copy them. It's probably faster to use a C-style for loop, but I had an allergic reaction when I considered doing that, so I left it as an exercise for the reader. Allow zero length vectors, to represent empty documents.

$v = [ 1, 4, 7, 10, 13,  16, 19, 22 ];
$w = [ 1, 2, 4, 8, 16 ];

sub dot_prod {
    # report error unless args exist and are references to arrays
    return unless (  (2 = @_ )                   &&
                          (ref $_[0] eq "ARRAY") && 
                          (ref $_[1] eq "ARRAY") );

    # if either array is empty,  the dot product is zero.
    return 0 unless ( @{$_[0]} && @{$_[1]} );

    my @v = @{$_[0]};
    my @w = @{$_[1]};
    my $result = 0;
    my ( $v, $w ) = ( shift @v, shift @w );
    
    do { 
         if ( $v == $w ) {
            $v = shift @v;
            $w = shift @w;
            $result++;
         } elsif ( $v > $w ) {
             $w = shift @w;
         } elsif ( $v < $w ) {
          $v = shift @v;
         }
    } while ( @v && @w );

    return $result;
}

sub norm {
   return 0 unless ( @_ && ( ref $_[0] eq "ARRAY ) )
   return sqrt( @{$_[0]} );
} 

my $dot_product = dot_prod( $v, $w ) or die( "dot_prod error" );;
my $denom = (norm( $v ) * norm( $w );
my $cos_theta = $denom ? $dot_product / $denom : 0;
[download]

Time for another optimization: instead of calcultating the norm over and over, make the norm the first entry in each array, and skip over it in the dot product routine.

So your database will contain one table which is an array of words, with an index for each. As you add new documents, with new words, the table grows larges, but I doubt that the number of distinct words will ever exceed the size of a 16 bit integer, let alone 32 bits. Then for each document, you just have a list of which words were present. I would guess your auction description documents would be 1,000 words, let's call it 10,000 worst case. But half of those are prepositions and articles and other stop-words. The other words will be repeated a nuumber of times, but still, you'll wind up with a list of 200 to 2000 significant terms. Ten years from now, your word table may have grown large, if place names and historic periods and craftsmen are named, but each document will remain 200 to 2000 32-bit integers.

--
TTTATCGGTCGTTATATAGATGTTTGCA

Comment on Re: Refining a 'vector space search'. Select or Download Code


go ahead... be a heretic
	PerlMonks