http://qs321.pair.com?node_id=752247

punkish has asked for the wisdom of the Perl Monks concerning the following question:

Part 1


I have the following schema in a SQLite db that is 430 MB on my 2.4 GHz Core 2 duo Macbook laptop's 320 GB HFS+ formatted 7200 RPM disk with an 8 MB cache. The laptop has 4 GB RAM. update: This will eventually reside on a quad core Xeon 3 GHz Xserve with 32 GB RAM and a 1 TB 7200 RPM disk (don't know its cache).

-- 1000,000 rows CREATE TABLE cell ( cell_id INTEGER PRIMARY KEY, met_cell_id INTEGER, 8 other INTEGER or REAL columns ) -- 38 rows CREATE TABLE lc ( lc_id INTEGER PRIMARY KEY, 56 other INTEGER or REAL columns ) -- 10 rows CREATE TABLE dist ( dist_id INTEGER PRIMARY KEY, 37 other INTEGER or REAL columns ) -- 2,920,000 rows CREATE TABLE met ( met_id INTEGER PRIMARY KEY, met_cell_id INTEGER, 9 other INTEGER or REAL columns ) CREATE TABLE cell_lc (cell_id INTEGER, lc_id INTEGER) CREATE TABLE cell_dist (cell_id INTEGER, dist_id INTEGER) CREATE INDEX idx_met_cell_id ON met (met_cell_id) CREATE INDEX idx_cell_lc ON cell_lc (cell_id) CREATE INDEX idx_cell_dist ON cell_dist (cell_id)

I also have an R*Tree index, but that is a different story, not relevant here.

I retrieve *all* data for one cell ':cell_id' using the following queries

[1] First retrieve all data from cell table SELECT * FROM cell WHERE cell_id = :cell_id [2] Now retrieve the related lc, dist and met SELECT lc.* FROM lc l JOIN cell_lc c on l.lc_id = c.lc_id WHERE c.cell_id = :cell_id [3] Retrieve the related dist SELECT d.* FROM dist d JOIN cell_lc c on d.dist_id = c.dist_id WHERE c.cell_id = :cell_id [4] Retrieve the related met SELECT * FROM met WHERE met_cell_id = <met_cell_id from query [1] abov +e>

I did some benchmarking with the above schema using DBD::SQLite, and I get about 30 transactions per second as long as I return the data to memory.

[08:38 AM] ~/Data/carbonmodel$perl carbonmodel.pl timethis 1: 0 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU) @ 33 +.33/s (n=1) timethis 10: 0 wallclock secs ( 0.31 usr + 0.02 sys = 0.33 CPU) @ 3 +0.30/s (n=10) timethis 100: 3 wallclock secs ( 2.85 usr + 0.20 sys = 3.05 CPU) @ +32.79/s (n=100) timethis 1000: 33 wallclock secs (31.08 usr + 1.22 sys = 32.30 CPU) @ + 30.96/s (n=1000)

if I write the data to file, the speed drops to about 1 transaction per second

timethis 1000: 783 wallclock secs (732.26 usr + 18.22 sys = 750.48 CPU +) @ 1.33/s (n=1000)

Even if I stick with manipulating the data in memory, at 30 transactions per second (or 33 ms per transaction), it would take more than 9 hours to query each of the 1 million cells one by one.

In the real world, I will first find the relevant cell ids based on lat-lon bounds (hence my R*Tree index) and then extract their data one by one.

How can I, if at all, speed this up?

Part 2


Alternatively, I could denormalize the data completely. Inspired by a post on the Flickr blog (Building Fast Client-side Searches), in particular the para

To make this data available quickly from the server, we maintain and update a per-member cache in our database, where we store each member’s contact list in a text blob — this way it’s a single quick DB query to retrieve it. We can format this blob in any way we want: XML, JSON, etc"

I decided to experiment with the same technique. So...

CREATE TABLE cell_blobs (cell_id INTEGER PRIMARY KEY, cell_data BLOB);

I then queried each cell as in Part 1, serialized it and stored it in the cell_blobs table. My intent is to simply retrieve a BLOB and deserialize it... (I am using the excellent Data::Serializer for the job) it would *possibly* be quicker than 33 ms per retrieval. Well, I haven't yet completed this test because each BLOB is taking about 430 KB (with compress on). At 1 million rows, that is going to occupy upward of 400 GB. I broke the load_blob_table routine after about a third of the records had been processed because I found even the loading_the_blobs to be excruciatingly slow.

Suggestions?

Update posted at 10:41 PM US Central Time on Mar 22, 2009: I ditched Data::Serializer and used bare Storable. The speed doubled! Data::Serializer is superbly easy, but too slow. Now, instead of 30-33 ms per transaction, I am getting 17-18 ms per transaction. Progress.

--

when small people start casting long shadows, it is time to go to bed

Replies are listed 'Best First'.
Re: speeding up row by row lookup in a large db
by perrin (Chancellor) on Mar 21, 2009 at 18:04 UTC
    If you were doing INSERTs or UPDDATEs, I'd tell you to use transactions rather than auto-commit, but I only see SELECTs. Another option, since you have tons of RAM, is to load the whole thing into hashes in memory and do all the lookups in perl. If you don't want to do that, I'd suggest switching to MySQL. SQLite is incredibly convenient, but it's not as fast.
      How do I load the entire data set into memory? The denormalized data set is a few hundred GB (at least, as far as the BLOBs approach suggests), and, isn't Perl (or any process on a 32 bit computer) restricted to addressing 2 GB RAM? I don't really understand that very well, so if you have a strategy I can use, I am all ears.

      Many thanks,

      Update: Adding this note to explain more in response to perrin's note. In my original post above I have shown only the SELECTs and mentioned that each SELECT takes about 33 ms. To try and reduce that, I tried converting the result of each select to a BLOB. Since each result is a ref to an array of arrays, I serialized it using Storable with the help of Data::Serializer and INSERTed it into the blobs table. This was stupendously slow. I tried with both transactions (experimented with committing every 10, 100 and 1000 INSERTs) and without transactions. Besides the fact that each BLOB becomes 430 KB, which would result in a db much larger than my laptop's drive if run to completion, fortunately for my laptop, the darn process ran overnight and had done only about 30,000 or so INSERTs.

      --

      when small people start casting long shadows, it is time to go to bed

        Well, in your question you said the whole database was 430MB, so you can see why I would suggest loading it into RAM. Perl should be able to access more than 2GB RAM on a 64-bit machine, and to some extent on 32-bit one if you have the right Linux kernel.

        INSERTs will definitely run faster if you only commit every 1000. There may be some other SQLite tuning tricks, which you'd probably find on a mailing list or wiki devoted to SQLite. But if none of those work for you, I think MySQL is your best bet.

      Apologies - that's what comes of replying while watching the rugby....
Re: speeding up row by row lookup in a large db
by samtregar (Abbot) on Mar 21, 2009 at 19:57 UTC
    First off, let me echo at maximum volume that you should drop SQLite right away. MySQL is faster, Postgresql is faster, everything is faster. SQLite is about convenience not speed.

    Second, you've got two processors on your current machine and four on your deployment machine. This means you need to think about how to get the most out of all that extra CPU horsepower. Usually this means finding a way to run multiple requests in parallel. My favorite tool for this job is Parallel::ForkManager although beware, it takes some finesse to get it working right with DBI due to problems with DBI and forking. If you do it right you might be able to run 2x as fast on your current machine and 4x as fast on your final target, so it's definitely worth the effort.

    UPDATE: if you do decide to use Parallel::ForkManager, check out this node - Parallel::ForkManager and DBD::mysql, the easy way. I started to put this info here and then decided it would be more useful as a separate node.

    -sam

Re: speeding up row by row lookup in a large db
by Ish (Acolyte) on Mar 21, 2009 at 17:50 UTC
    From the SQLite FAQ "INSERT is really slow - I can only do few dozen INSERTs per second Actually, SQLite will easily do 50,000 or more INSERT statements per second on an average desktop computer. But it will only do a few dozen transactions per second. Transaction speed is limited by the rotational speed of your disk drive. A transaction normally requires two complete rotations of the disk platter, which on a 7200RPM disk drive limits you to about 60 transactions per second. Transaction speed is limited by disk drive speed because (by default) SQLite actually waits until the data really is safely stored on the disk service before the transaction is complete. That way, if you suddenly lose power or if your OS crashes, your data is still safe. For details, read about atomic commit in SQLite.. By default, each INSERT statement is its own transaction. But if you surround multiple INSERT statements with BEGIN...COMMIT then all the inserts are grouped into a single transaction. The time needed to commit the transaction is amortized over all the enclosed insert statements and so the time per insert statement is greatly reduced. Another option is to run PRAGMA synchronous=OFF. This command will cause SQLite to not wait on data to reach the disk surface, which will make write operations appear to be much faster. But if you lose power in the middle of a transaction, your database file might go corrupt." Hope that is of use (and relevant!)

      If the speed problems are primarily caused by rotational speed and you have some extra money to spend ($150), an (inelegant, non-geeky) way to fix it would be purchasing a fast SLC SSD (solid state drive). Got myself the mtron mobi 16GB. Lacking mechanical components the random access properties are obviously quite good. Throughput with Read and Write Speeds of roughly 100MB/s is also decent.

      The recommendations of the other monks to use a "normal" dbms such as mysql will probably also result in a decent performance gain.
Re: speeding up row by row lookup in a large db
by sundialsvc4 (Abbot) on Mar 22, 2009 at 01:27 UTC

    Let's take a history-lesson from ... punched cards, of all things. When you have a large amount of data, sort it. If you have multiple data streams to integrate, sort each one by the same key.

    When you do this, all the records having any key value will be adjacent, and within any gaps there are known to be no keys at all. No searching is required, and you never need to consider anything more than: this record, and the preceding one.

    The thing that is eating your lunch here is a familiar one: thousands (or millions) of repetitive searches. “Come with me back to the days when more data than this was processed using much less computer power than today you will find in a dishwasher. It was not a world of hard drives: it was a world of magnetic tapes.” The procedures that were adopted out of necessity back then, are just as efficient today.

    The payoff for your efforts can be ... stunning.

      Appreciate the advice sundialsvc4. The data actually are sorted. I figured out the problem with the help of the most excellent Devel::NYTProf. The problem was in pulling the met data which is 7300 rows by 9 columns for each cell. This transaction was taking about 28 to 29 ms. I have to figure out how to either compress this data or represent it in some other way so that it gets pulled out quickly.
      --

      when small people start casting long shadows, it is time to go to bed

        Was the cell table composed of 100,000 or 1,000,000 rows? If 100,000 then most assuredly keep this in memory as a hash (i.e. read whole thing only once). That might not be a bad idea even if it's a million. And obviously do the same with the tiny lc and dist tables.

        Consider storing the data in each table as a blob, but first pack it, and blob the packed entity. IIRC, sqlite keeps data as strings, so it's more voluminous than the actual data. And then you have the cost of all those atoi's to decode it.

        And if your data can be organized by lat-long, consider a small database for each lat-long region.

        Also, consider the CPAN module Cache::Memcached.

Re: speeding up row by row lookup in a large db
by merlyn (Sage) on Mar 22, 2009 at 16:22 UTC
    If you have to fetch all million cells with individual selects, you're doing it wrong. Either pull out a million row result in one query, or push more of your data reduction to the database using stored procedures.