comment on

Part 1

I have the following schema in a SQLite db that is 430 MB on my 2.4 GHz Core 2 duo Macbook laptop's 320 GB HFS+ formatted 7200 RPM disk with an 8 MB cache. The laptop has 4 GB RAM. update: This will eventually reside on a quad core Xeon 3 GHz Xserve with 32 GB RAM and a 1 TB 7200 RPM disk (don't know its cache).

-- 1000,000 rows
CREATE TABLE cell (
  cell_id INTEGER PRIMARY KEY,
  met_cell_id INTEGER,
  8 other INTEGER or REAL columns
)

-- 38 rows
CREATE TABLE lc (
  lc_id INTEGER PRIMARY KEY,
  56 other INTEGER or REAL columns
)

-- 10 rows
CREATE TABLE dist (
  dist_id INTEGER PRIMARY KEY,
  37 other INTEGER or REAL columns
)

-- 2,920,000 rows
CREATE TABLE met (
  met_id INTEGER PRIMARY KEY,
  met_cell_id INTEGER,
  9 other INTEGER or REAL columns
)

CREATE TABLE cell_lc (cell_id INTEGER, lc_id INTEGER)
CREATE TABLE cell_dist (cell_id INTEGER, dist_id INTEGER)

CREATE INDEX idx_met_cell_id ON met (met_cell_id)
CREATE INDEX idx_cell_lc ON cell_lc (cell_id)
CREATE INDEX idx_cell_dist ON cell_dist (cell_id)
[download]

I also have an R*Tree index, but that is a different story, not relevant here.

I retrieve *all* data for one cell ':cell_id' using the following queries

[1] First retrieve all data from cell table
SELECT * FROM cell WHERE cell_id = :cell_id

[2] Now retrieve the related lc, dist and met
SELECT lc.* 
FROM lc l JOIN cell_lc c on l.lc_id = c.lc_id 
WHERE c.cell_id = :cell_id

[3] Retrieve the related dist
SELECT d.* 
FROM dist d JOIN cell_lc c on d.dist_id = c.dist_id 
WHERE c.cell_id = :cell_id

[4] Retrieve the related met
SELECT * FROM met WHERE met_cell_id = <met_cell_id from query [1] abov
+e>
[download]

I did some benchmarking with the above schema using DBD::SQLite, and I get about 30 transactions per second as long as I return the data to memory.

[08:38 AM] ~/Data/carbonmodel$perl carbonmodel.pl

timethis 1:  0 wallclock secs ( 0.03 usr +  0.00 sys =  0.03 CPU) @ 33
+.33/s (n=1)

timethis 10:  0 wallclock secs ( 0.31 usr +  0.02 sys =  0.33 CPU) @ 3
+0.30/s (n=10)

timethis 100:  3 wallclock secs ( 2.85 usr +  0.20 sys =  3.05 CPU) @ 
+32.79/s (n=100)

timethis 1000: 33 wallclock secs (31.08 usr +  1.22 sys = 32.30 CPU) @
+ 30.96/s (n=1000)
[download]

if I write the data to file, the speed drops to about 1 transaction per second

timethis 1000: 783 wallclock secs (732.26 usr + 18.22 sys = 750.48 CPU
+) @  1.33/s (n=1000)
[download]

Even if I stick with manipulating the data in memory, at 30 transactions per second (or 33 ms per transaction), it would take more than 9 hours to query each of the 1 million cells one by one.

In the real world, I will first find the relevant cell ids based on lat-lon bounds (hence my R*Tree index) and then extract their data one by one.

How can I, if at all, speed this up?

Part 2

Alternatively, I could denormalize the data completely. Inspired by a post on the Flickr blog (Building Fast Client-side Searches), in particular the para

To make this data available quickly from the server, we maintain and update a per-member cache in our database, where we store each member’s contact list in a text blob — this way it’s a single quick DB query to retrieve it. We can format this blob in any way we want: XML, JSON, etc"

I decided to experiment with the same technique. So...

CREATE TABLE cell_blobs (cell_id INTEGER PRIMARY KEY, cell_data BLOB);
[download]

I then queried each cell as in Part 1, serialized it and stored it in the cell_blobs table. My intent is to simply retrieve a BLOB and deserialize it... (I am using the excellent Data::Serializer for the job) it would *possibly* be quicker than 33 ms per retrieval. Well, I haven't yet completed this test because each BLOB is taking about 430 KB (with compress on). At 1 million rows, that is going to occupy upward of 400 GB. I broke the load_blob_table routine after about a third of the records had been processed because I found even the loading_the_blobs to be excruciatingly slow.

Suggestions?

Update posted at 10:41 PM US Central Time on Mar 22, 2009: I ditched Data::Serializer and used bare Storable. The speed doubled! Data::Serializer is superbly easy, but too slow. Now, instead of 30-33 ms per transaction, I am getting 17-18 ms per transaction. Progress.

--

when small people start casting long shadows, it is time to go to bed

In reply to speeding up row by row lookup in a large db by punkish

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Problems? Is your data what you think it is?
	PerlMonks