Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re: Netflix (or on handling large amounts of data efficiently in perl)

by tilly (Archbishop)
on Dec 24, 2008 at 04:35 UTC ( [id://732410]=note: print w/replies, xml ) Need Help??


in reply to Netflix (or on handling large amounts of data efficiently in perl)

mmap is being used as a cheap way to sidestep I/O. That doesn't make as much sense in Perl. A more natural solution in Perl is to use a dbm like Berkeley DB. Going an alternate direction you can probably store your information in under 1 GB using vec to store a vector of 32-bit numbers, each of which uses 3 bits for the rating, and the rest for the user ID.

Personally I'd be inclined to use vec. (Actually I'd be inclined to use another language than Perl...)

Replies are listed 'Best First'.
Re^2: Netflix (or on handling large amounts of data efficiently in perl)
by diotalevi (Canon) on Dec 27, 2008 at 02:21 UTC

      Let's see if I'm following your reasoning correctly.

      I'm essentially interested in three variables:
      $movieid
      $userid
      $rating

      Are you suggesting that I make a multi-dimensional Judy array of arrays? So for each movie create a Judy array using $userid as the index and $rating as the value, then put that into a Judy array as the value with $movieid as the index?

      Apologies if I'm stating the obvious, I wouldn't classify myself as a programmer.

      From a very, very rough test (not even gone back to confirm availability of data) this is looking very good indeed for memory consumption. Will do some further testing tomorrow

        Sure, why not. tilly originally mentioned a bitmap so I mentioned something cheaper in memory. You can build multi dimensional Judy arrays. In particular, JudyHS is implemented as a nested set of JudyL arrays. I've posted a snippet at Dump JudyHS which demos dumping a JudyHS structure.

        It's explicitly required for this to work that Judy is nestable.

        ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

Re^2: Netflix (or on handling large amounts of data efficiently in perl)
by Garp (Acolyte) on Dec 24, 2008 at 22:50 UTC

    Thanks for your suggestions. Currently trying to push the data out into a BerkeleyDB now having spent a few hours this morning trying to get an understanding of bdb usage. Gave up trying to understand MLDMB & bdb for now, the documentation on CPAN just got a bit weird. Found great resources using DB_File but sadly ActivePerl haven't managed to get that into their repository so far (I really miss having a *nix box around when it comes to this stuff!)

    Vec? Urgh, more time wading through perldoc ahead. Great technical resource, but half of it can be a pain for anyone not from a comp-sci or c++ programming background!

      Random tip. Try http://strawberryperl.com/ and see if it lessens the pain of Windows.

      A more technical tip. Try sorting your data and using a btree format for your data. With a hash you do a lot of seeking to disk, and seeks to disk are slow. 1/200th of a second per seek may not sound like a lot, but try doing 100 million of them and you will take the better part of a week. But a btree loaded and accessed in close to sorted order does lots of streaming to/from disk and that is quite fast. (And a merge sort streams data very well.)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://732410]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (8)
As of 2024-03-28 18:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found