http://qs321.pair.com?node_id=732407

Garp has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I'm fairly much a n00b when it come to Perl, but as I find myself with a fair bit of spare time on my hands I'm wanting to learn to use it a bit more productively. I'm trying to go from just 'script hacking' stuff that aids sysadmin work to something a little more effective.

This is probably a classic case of running before I can walk, but I thought I'd look at the netflix challenge, not out of any naive belief that I'd be successful but because it gives me specific goals and stuff to get my teeth into. That and something about it is appealing to my inner geek.

So I hit the first hurdle, and sort of fell flat on my face. Specifically it's the sheer weight of data. I successfully imported that into MySQL and did some basic data mining there, but I started pondering about more efficient ways to handle the data, and whether it would be possible to speed everything up by storing the data in memory somehow for efficient access. MySQL is handy and all, but seems a pointless overhead for what is essentially static data.

For those unfamiliar with the challenge the data set contains over 100 million movie ratings (all integers, between, 1 -> 5), done by 17770 users on 480,189 films. User id's aren't contiguous, just to be helpful, and they can be between 1 and 2649429. I'm half inclined to map the netflix user ids to new sequential ones but I'm not sure if there would be any benefit from that as it would still be a long integer IIRC.

Data comes as a set of text files, one per movie, containing userid, rating and date. From what I've gathered Netflix's own engine doesn't pay attention to the date of the voting, nor do those of the top teams, correlation between the date of the vote and a user liking or disliking a film is negligible (hardly surprising.) So that leaves me with movie_id, user_id and rating. 480,189 x 17,770 = 8,532,958,530 data points at a size of 1 byte each, I assume, 8.5Gb of data! The reality is naturally that not every user has rated every film, hence the 100m figure above. If my math is remotely right thats only 95Mb of actual rating data, but then each vote will need to be paired to some form index.

The original data in plain text form takes up about 2Gb of disk space. Loading that into memory is just asking for trouble, and impossible given the constraints of the machine I've got access to (Windows Vista laptop with 2Gb RAM), let alone the data being inefficiently indexed.

One user in the challenge, called icefox, has produced an open source framework in C++ that manages the feat quite effectively (google "icefox netflix framework"). Data is stored in two files, one containing every vote and user sorted by movie id, and a second file filled with references to the starting location for each movie in the first file. This is then mmap'd into memory for use.
That's the route I'm currently trying to see if I can implement under Perl.

Does anyone have any tips, suggested reading or the like that would be a good way to start learning how to handle such large amounts of data, if the above approach is a poor idea; or indeed any critique or suggestions at all.
I've seen references to PDL being useful. If I'm understanding the documentation right it's effectively doing an mmap under the surface?
If I'm able to form a reasonable framework in perl that is both efficient in memory and speed my intention is to host it open source on my website for other perl folks to use for the challenge, or whatever else weird projects they have in mind!

  • Comment on Netflix (or on handling large amounts of data efficiently in perl)

Replies are listed 'Best First'.
Re: Netflix (or on handling large amounts of data efficiently in perl)
by Tanktalus (Canon) on Dec 24, 2008 at 14:43 UTC
    sheer weight of data

    That, right there, screams, "database." This volume of data ("overwhelming") is exactly what you use a database for. Don't worry about the speed - you can tweek the parameters to speed things up (a sweet spot for RAM usage where the db keeps stuff in memory, for example). Or you can move to a faster database. Or bigger hardware. But what you're not going to do is beat the speed by writing your own code in perl. Not because perl isn't fast, but because it'll take you forever to do (and then there's bugfixing).

    If you don't want the external dependency of MySQL, then try DBD::SQLite, though I suspect MySQL to be faster.

    By having a database system, complete with proper indexing, you can shunt most of the heavy lifting off to the C/C++ code instead, including its native handling of strings, etc., with far less overhead than Perl. It'll do precisely what the C++ code you refer to does: have an index which it searches, and then uses the offsets there to find the data in the data file(s). And it'll use mmap, if that's what is appropriate. And you don't have to write any code - just call the API with the query you want. This will allow you to focus on the real problem you're trying to tackle rather than computer-science details about how to support the problem.

Re: Netflix (or on handling large amounts of data efficiently in perl)
by tilly (Archbishop) on Dec 24, 2008 at 04:35 UTC
    mmap is being used as a cheap way to sidestep I/O. That doesn't make as much sense in Perl. A more natural solution in Perl is to use a dbm like Berkeley DB. Going an alternate direction you can probably store your information in under 1 GB using vec to store a vector of 32-bit numbers, each of which uses 3 bits for the rating, and the rest for the user ID.

    Personally I'd be inclined to use vec. (Actually I'd be inclined to use another language than Perl...)

        Let's see if I'm following your reasoning correctly.

        I'm essentially interested in three variables:
        $movieid
        $userid
        $rating

        Are you suggesting that I make a multi-dimensional Judy array of arrays? So for each movie create a Judy array using $userid as the index and $rating as the value, then put that into a Judy array as the value with $movieid as the index?

        Apologies if I'm stating the obvious, I wouldn't classify myself as a programmer.

        From a very, very rough test (not even gone back to confirm availability of data) this is looking very good indeed for memory consumption. Will do some further testing tomorrow

      Thanks for your suggestions. Currently trying to push the data out into a BerkeleyDB now having spent a few hours this morning trying to get an understanding of bdb usage. Gave up trying to understand MLDMB & bdb for now, the documentation on CPAN just got a bit weird. Found great resources using DB_File but sadly ActivePerl haven't managed to get that into their repository so far (I really miss having a *nix box around when it comes to this stuff!)

      Vec? Urgh, more time wading through perldoc ahead. Great technical resource, but half of it can be a pain for anyone not from a comp-sci or c++ programming background!

        Random tip. Try http://strawberryperl.com/ and see if it lessens the pain of Windows.

        A more technical tip. Try sorting your data and using a btree format for your data. With a hash you do a lot of seeking to disk, and seeks to disk are slow. 1/200th of a second per seek may not sound like a lot, but try doing 100 million of them and you will take the better part of a week. But a btree loaded and accessed in close to sorted order does lots of streaming to/from disk and that is quite fast. (And a merge sort streams data very well.)

Re: Netflix (or on handling large amounts of data efficiently in perl)
by matrixmadhan (Beadle) on Dec 24, 2008 at 05:17 UTC
    Nice problem

    I have got some suggestions regarding the data representation optimization that I think is feasible to achieve with respect to this problem

    movie_id, user_id and rating.
    From your post it seems that the above 3 values are critical and without user_id ; <movie_id> and <rating> pairs from the users cannot be unique and its a repetitive pattern

    For ex:
    <movie_id><user_id><rating>
    <1><U1><2>
    <1><U2><3>
    <1><U3><2>
    <1><U4><3>

    Here with the above sample data, movie_id and rating have got a repeating pattern so a map of 5 possible values for each and every movie can be used instead of storing movie_id and a rating each time.

    <movie_id><rating>
    <1><1> => a
    <1><2> => b
    <1><3> => c
    <1><4> => d
    <1><5> => e

    and the new combination would be only user_id and the above map

    ex:
    <U1><b1>
    <U2><c1>

    though it adds to additional lookup and retrieval the actual storage of data is compressed in terms of mapping to new values.
    The same logic can also be extended to secondary level of mapping to include "users with specific rating pattern"

    <userid><rating>
    <U1><1> => a1
    <U1><2> => a2

    and the above values can be used along with the movie id.

    Going for the lookup implementation a simple berkely db would be easier to go with in terms of implementation and retrieval

    Alternative that you might think of is appending attribute_values and storing them but its not going to do any good in terms of retrieval or storage.

    Please feel free to say that am wrong if am really wrong. :)

      It really depends what processing needs to be done on the data. You are trading space for speed. Here if you need to get all the data for a movie, you will need to go through all of the users. So before choosing a format to store the data, it might be useful to know what you want to do with it first.

        I perfectly agree that space is being traded for speed. But as per the OP it seems that the storage is much more important than the retrieval, so I think my approach might prove well to be a fit.

        As an improved version of storing them as a map, an auto-generator can be applied where based on the index retrieval even contiguous storage can be used without even having to create lookup maps and retrieving from them.
Re: Netflix (or on handling large amounts of data efficiently in perl)
by BrowserUk (Patriarch) on Dec 25, 2008 at 02:48 UTC

    I had a play with the data back when the NetFlix challenge was first mentioned here.

    Simply stated, the challenge dataset requires a minimum of: 17,770 * 480,189 * 3 (n where 2**n >= 5) / 8 = 2.98GB of data.

    Which is greater than most (any?) 32-bit system can accommodate in memory.

    The alternatives, performing complex joins (directly or indirectly), using disk-based storage--I tried RDBMS(pgsql), Berkeley; SqlLite3; and custom Memory-mapped files using Inline::C--all rendered the experimentation process so laborious, that it would tie up my 2.6GHz single cpu machine for 33(MMAP); 40 (Berkeley) to 100+ (MySQL) hours.

    I abandoned my attempts because I concluded that life was too short to wait 2 to 4 days to discover whether the latest tweak to my algorithm was an improvement or not. That without access to either a 64-bit machine with at least 8GB of ram, or a cluster of 4x32-bit machines and suitable operating software, the cycle time was simply too long to sustain my interest.

    The funny thing about the sample dataset is that there is (statistically) no reason for it to be so large. A much smaller dataset would have been equally valid as a means of sampling algorithms. I tried producing a representative sub-sample of the supplied sample, to speed up the development cycle, but sub-sampling samples is a notoriously hit & miss affair--tending as it does to emphasis any bias in the original sample. My attempts failed to produce a subsample that evaluated representatively of the supplied sample. It's almost as if the supplied sample was chosen to preclude serious participation by the lone developer.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Actually, if you look at icefox's framework it will do it's basic average run in:

      real 0m1.348s
      user 0m0.100s
      sys 0m0.400s

      That's running on an Ubuntu server inside a VirtualBox VM, so just a single core of a T2060 (1.6Ghz Core 2 Duo.)

      Some of the more complicated averages naturally take more time, but it's still in the realm of reasonable for home users.

        Sorry. I couldn't be bothered to sift through the 500+ links that googling "icefox netflix framework" threw up.

        If the code is available somewhere, why not post a direct link?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Netflix (or on handling large amounts of data efficiently in perl)
by etj (Deacon) on Jun 03, 2022 at 23:05 UTC
Re: Netflix (or on handling large amounts of data efficiently in perl)
by Anonymous Monk on Dec 24, 2008 at 15:45 UTC
    It is my understanding that SPEED is not the primary goal of the challenge, but better algorithm. I wouldn't worry about speed/memory until my algorithm was developed.

      Agreed, speed isn't the primary concern of the challenge. Two reasons I'm looking specifically at this end of things though:

      1) Most of the existing teams are now down to blending results, requiring multiple usage of algorithms before the final blend is performed. In the interests of keeping this process fairly sane, storing and retrieving the data efficiently seems logical given it will be accessed repeatedly.
      2) Trying to follow what I perceive to be some form of good practice. No point writing code based around one data access method and then discovering I've got to re-write half of it just because I've changed how the data is stored.

Re: Netflix (or on handling large amounts of data efficiently in perl)
by Withigo (Friar) on Jan 02, 2009 at 11:00 UTC
    Shortly after the contest started I ran into the exact same problem you've described. Very early on while still in the back-of-the-napkin estimation phase I realized that using any kind of database whatsoever was completely out of the question since I don't have access to a supercomputer to even be able to store the entire matrix in memory all at once. The data is not exactly relational--since it's just one giant, flat matrix all you're basically doing is counting. But to run any common statistical algorithms(lifted verbatim from "Numerical Recipes in C") requires an overwhelming number of multiplications and passes over the entire dataset, so any disk swapping(such as using mmap) would impose way too large of a time constraint. The matrix is extremely sparse--something like 99.99% empty.

    So I dropped down to C via XS and serialized the data to disk as raw binary files using the Compressed Column Storage algorithm, packing ints and doubles. I then put the binary files on Amazon's S3, and launched a few EC2 instances to handle separate chunks of the data files.
    I thought calculating Pearson's correlation coefficient for every movie against every other movie(ditto for user against every user) might lead to a good result to start with, but completing this calculation would require something like 100 servers running for 70 years(or 1000 for 7 years).

    Seeing as this was just an interesting side project to goof around with, I wasn't interested in racking up thousands of dollars in hosting expenses, so I gave up on this approach. It seems that most of the folks on the leaderboard have been using SVD, and I have no idea how they are actually computing this using desktops--maybe I'm missing something obvious. But in all it was a fun learning experience--I had no idea it would end up being so complex along the way.