Re: Netflix (or on handling large amounts of data efficiently in perl)

Nice problem

I have got some suggestions regarding the data representation optimization that I think is feasible to achieve with respect to this problem

movie_id, user_id and rating.
From your post it seems that the above 3 values are critical and without user_id ; <movie_id> and <rating> pairs from the users cannot be unique and its a repetitive pattern

For ex:
<movie_id><user_id><rating>
<1><U1><2>
<1><U2><3>
<1><U3><2>
<1><U4><3>

Here with the above sample data, movie_id and rating have got a repeating pattern so a map of 5 possible values for each and every movie can be used instead of storing movie_id and a rating each time.

<movie_id><rating>
<1><1> => a
<1><2> => b
<1><3> => c
<1><4> => d
<1><5> => e

and the new combination would be only user_id and the above map

ex:
<U1><b1>
<U2><c1>

though it adds to additional lookup and retrieval the actual storage of data is compressed in terms of mapping to new values.
The same logic can also be extended to secondary level of mapping to include "users with specific rating pattern"

<userid><rating>
<U1><1> => a1
<U1><2> => a2

and the above values can be used along with the movie id.

Going for the lookup implementation a simple berkely db would be easier to go with in terms of implementation and retrieval

Alternative that you might think of is appending attribute_values and storing them but its not going to do any good in terms of retrieval or storage.

Please feel free to say that am wrong if am really wrong. :)

Comment on Re: Netflix (or on handling large amounts of data efficiently in perl)

Replies are listed 'Best First'.
Re^2: Netflix (or on handling large amounts of data efficiently in perl) by mirod (Canon) on Dec 24, 2008 at 14:10 UTC
It really depends what processing needs to be done on the data. You are trading space for speed. Here if you need to get all the data for a movie, you will need to go through all of the users. So before choosing a format to store the data, it might be useful to know what you want to do with it first.	[reply]
Re^3: Netflix (or on handling large amounts of data efficiently in perl) by matrixmadhan (Beadle) on Dec 25, 2008 at 05:05 UTC
I perfectly agree that space is being traded for speed. But as per the OP it seems that the storage is much more important than the retrieval, so I think my approach might prove well to be a fit. As an improved version of storing them as a map, an auto-generator can be applied where based on the index retrieval even contiguous storage can be used without even having to create lookup maps and retrieving from them.	[reply]


Welcome to the Monastery
	PerlMonks