punkish has asked for the wisdom of the Perl Monks concerning the following question:
Part 1
I have the following schema in a SQLite db that is 430 MB on my 2.4 GHz Core 2 duo Macbook laptop's 320 GB HFS+ formatted 7200 RPM disk with an 8 MB cache. The laptop has 4 GB RAM. update: This will eventually reside on a quad core Xeon 3 GHz Xserve with 32 GB RAM and a 1 TB 7200 RPM disk (don't know its cache).
-- 1000,000 rows CREATE TABLE cell ( cell_id INTEGER PRIMARY KEY, met_cell_id INTEGER, 8 other INTEGER or REAL columns ) -- 38 rows CREATE TABLE lc ( lc_id INTEGER PRIMARY KEY, 56 other INTEGER or REAL columns ) -- 10 rows CREATE TABLE dist ( dist_id INTEGER PRIMARY KEY, 37 other INTEGER or REAL columns ) -- 2,920,000 rows CREATE TABLE met ( met_id INTEGER PRIMARY KEY, met_cell_id INTEGER, 9 other INTEGER or REAL columns ) CREATE TABLE cell_lc (cell_id INTEGER, lc_id INTEGER) CREATE TABLE cell_dist (cell_id INTEGER, dist_id INTEGER) CREATE INDEX idx_met_cell_id ON met (met_cell_id) CREATE INDEX idx_cell_lc ON cell_lc (cell_id) CREATE INDEX idx_cell_dist ON cell_dist (cell_id)
I also have an R*Tree index, but that is a different story, not relevant here.
I retrieve *all* data for one cell ':cell_id' using the following queries
[1] First retrieve all data from cell table SELECT * FROM cell WHERE cell_id = :cell_id [2] Now retrieve the related lc, dist and met SELECT lc.* FROM lc l JOIN cell_lc c on l.lc_id = c.lc_id WHERE c.cell_id = :cell_id [3] Retrieve the related dist SELECT d.* FROM dist d JOIN cell_lc c on d.dist_id = c.dist_id WHERE c.cell_id = :cell_id [4] Retrieve the related met SELECT * FROM met WHERE met_cell_id = <met_cell_id from query [1] abov +e>
I did some benchmarking with the above schema using DBD::SQLite, and I get about 30 transactions per second as long as I return the data to memory.
[08:38 AM] ~/Data/carbonmodel$perl carbonmodel.pl timethis 1: 0 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU) @ 33 +.33/s (n=1) timethis 10: 0 wallclock secs ( 0.31 usr + 0.02 sys = 0.33 CPU) @ 3 +0.30/s (n=10) timethis 100: 3 wallclock secs ( 2.85 usr + 0.20 sys = 3.05 CPU) @ +32.79/s (n=100) timethis 1000: 33 wallclock secs (31.08 usr + 1.22 sys = 32.30 CPU) @ + 30.96/s (n=1000)
if I write the data to file, the speed drops to about 1 transaction per second
timethis 1000: 783 wallclock secs (732.26 usr + 18.22 sys = 750.48 CPU +) @ 1.33/s (n=1000)
Even if I stick with manipulating the data in memory, at 30 transactions per second (or 33 ms per transaction), it would take more than 9 hours to query each of the 1 million cells one by one.
In the real world, I will first find the relevant cell ids based on lat-lon bounds (hence my R*Tree index) and then extract their data one by one.
How can I, if at all, speed this up?
Part 2
Alternatively, I could denormalize the data completely. Inspired by a post on the Flickr blog (Building Fast Client-side Searches), in particular the para
To make this data available quickly from the server, we maintain and update a per-member cache in our database, where we store each member’s contact list in a text blob — this way it’s a single quick DB query to retrieve it. We can format this blob in any way we want: XML, JSON, etc"
I decided to experiment with the same technique. So...
CREATE TABLE cell_blobs (cell_id INTEGER PRIMARY KEY, cell_data BLOB);
I then queried each cell as in Part 1, serialized it and stored it in the cell_blobs table. My intent is to simply retrieve a BLOB and deserialize it... (I am using the excellent Data::Serializer for the job) it would *possibly* be quicker than 33 ms per retrieval. Well, I haven't yet completed this test because each BLOB is taking about 430 KB (with compress on). At 1 million rows, that is going to occupy upward of 400 GB. I broke the load_blob_table routine after about a third of the records had been processed because I found even the loading_the_blobs to be excruciatingly slow.
Suggestions?
Update posted at 10:41 PM US Central Time on Mar 22, 2009: I ditched Data::Serializer and used bare Storable. The speed doubled! Data::Serializer is superbly easy, but too slow. Now, instead of 30-33 ms per transaction, I am getting 17-18 ms per transaction. Progress.
when small people start casting long shadows, it is time to go to bed
|
---|