|XP is just a number|
speeding up row by row lookup in a large dbby punkish (Priest)
|on Mar 21, 2009 at 14:39 UTC||Need Help??|
punkish has asked for the wisdom of the Perl Monks concerning the following question:
I have the following schema in a SQLite db that is 430 MB on my 2.4 GHz Core 2 duo Macbook laptop's 320 GB HFS+ formatted 7200 RPM disk with an 8 MB cache. The laptop has 4 GB RAM. update: This will eventually reside on a quad core Xeon 3 GHz Xserve with 32 GB RAM and a 1 TB 7200 RPM disk (don't know its cache).
I also have an R*Tree index, but that is a different story, not relevant here.
I retrieve *all* data for one cell ':cell_id' using the following queries
I did some benchmarking with the above schema using DBD::SQLite, and I get about 30 transactions per second as long as I return the data to memory.
if I write the data to file, the speed drops to about 1 transaction per second
Even if I stick with manipulating the data in memory, at 30 transactions per second (or 33 ms per transaction), it would take more than 9 hours to query each of the 1 million cells one by one.
In the real world, I will first find the relevant cell ids based on lat-lon bounds (hence my R*Tree index) and then extract their data one by one.
How can I, if at all, speed this up?
Alternatively, I could denormalize the data completely. Inspired by a post on the Flickr blog (Building Fast Client-side Searches), in particular the para
To make this data available quickly from the server, we maintain and update a per-member cache in our database, where we store each memberís contact list in a text blob ó this way itís a single quick DB query to retrieve it. We can format this blob in any way we want: XML, JSON, etc"
I decided to experiment with the same technique. So...
I then queried each cell as in Part 1, serialized it and stored it in the cell_blobs table. My intent is to simply retrieve a BLOB and deserialize it... (I am using the excellent Data::Serializer for the job) it would *possibly* be quicker than 33 ms per retrieval. Well, I haven't yet completed this test because each BLOB is taking about 430 KB (with compress on). At 1 million rows, that is going to occupy upward of 400 GB. I broke the load_blob_table routine after about a third of the records had been processed because I found even the loading_the_blobs to be excruciatingly slow.
Update posted at 10:41 PM US Central Time on Mar 22, 2009: I ditched Data::Serializer and used bare Storable. The speed doubled! Data::Serializer is superbly easy, but too slow. Now, instead of 30-33 ms per transaction, I am getting 17-18 ms per transaction. Progress.
when small people start casting long shadows, it is time to go to bed