Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Part 1


I have the following schema in a SQLite db that is 430 MB on my 2.4 GHz Core 2 duo Macbook laptop's 320 GB HFS+ formatted 7200 RPM disk with an 8 MB cache. The laptop has 4 GB RAM. update: This will eventually reside on a quad core Xeon 3 GHz Xserve with 32 GB RAM and a 1 TB 7200 RPM disk (don't know its cache).

-- 1000,000 rows CREATE TABLE cell ( cell_id INTEGER PRIMARY KEY, met_cell_id INTEGER, 8 other INTEGER or REAL columns ) -- 38 rows CREATE TABLE lc ( lc_id INTEGER PRIMARY KEY, 56 other INTEGER or REAL columns ) -- 10 rows CREATE TABLE dist ( dist_id INTEGER PRIMARY KEY, 37 other INTEGER or REAL columns ) -- 2,920,000 rows CREATE TABLE met ( met_id INTEGER PRIMARY KEY, met_cell_id INTEGER, 9 other INTEGER or REAL columns ) CREATE TABLE cell_lc (cell_id INTEGER, lc_id INTEGER) CREATE TABLE cell_dist (cell_id INTEGER, dist_id INTEGER) CREATE INDEX idx_met_cell_id ON met (met_cell_id) CREATE INDEX idx_cell_lc ON cell_lc (cell_id) CREATE INDEX idx_cell_dist ON cell_dist (cell_id)

I also have an R*Tree index, but that is a different story, not relevant here.

I retrieve *all* data for one cell ':cell_id' using the following queries

[1] First retrieve all data from cell table SELECT * FROM cell WHERE cell_id = :cell_id [2] Now retrieve the related lc, dist and met SELECT lc.* FROM lc l JOIN cell_lc c on l.lc_id = c.lc_id WHERE c.cell_id = :cell_id [3] Retrieve the related dist SELECT d.* FROM dist d JOIN cell_lc c on d.dist_id = c.dist_id WHERE c.cell_id = :cell_id [4] Retrieve the related met SELECT * FROM met WHERE met_cell_id = <met_cell_id from query [1] abov +e>

I did some benchmarking with the above schema using DBD::SQLite, and I get about 30 transactions per second as long as I return the data to memory.

[08:38 AM] ~/Data/carbonmodel$perl carbonmodel.pl timethis 1: 0 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU) @ 33 +.33/s (n=1) timethis 10: 0 wallclock secs ( 0.31 usr + 0.02 sys = 0.33 CPU) @ 3 +0.30/s (n=10) timethis 100: 3 wallclock secs ( 2.85 usr + 0.20 sys = 3.05 CPU) @ +32.79/s (n=100) timethis 1000: 33 wallclock secs (31.08 usr + 1.22 sys = 32.30 CPU) @ + 30.96/s (n=1000)

if I write the data to file, the speed drops to about 1 transaction per second

timethis 1000: 783 wallclock secs (732.26 usr + 18.22 sys = 750.48 CPU +) @ 1.33/s (n=1000)

Even if I stick with manipulating the data in memory, at 30 transactions per second (or 33 ms per transaction), it would take more than 9 hours to query each of the 1 million cells one by one.

In the real world, I will first find the relevant cell ids based on lat-lon bounds (hence my R*Tree index) and then extract their data one by one.

How can I, if at all, speed this up?

Part 2


Alternatively, I could denormalize the data completely. Inspired by a post on the Flickr blog (Building Fast Client-side Searches), in particular the para

To make this data available quickly from the server, we maintain and update a per-member cache in our database, where we store each member’s contact list in a text blob — this way it’s a single quick DB query to retrieve it. We can format this blob in any way we want: XML, JSON, etc"

I decided to experiment with the same technique. So...

CREATE TABLE cell_blobs (cell_id INTEGER PRIMARY KEY, cell_data BLOB);

I then queried each cell as in Part 1, serialized it and stored it in the cell_blobs table. My intent is to simply retrieve a BLOB and deserialize it... (I am using the excellent Data::Serializer for the job) it would *possibly* be quicker than 33 ms per retrieval. Well, I haven't yet completed this test because each BLOB is taking about 430 KB (with compress on). At 1 million rows, that is going to occupy upward of 400 GB. I broke the load_blob_table routine after about a third of the records had been processed because I found even the loading_the_blobs to be excruciatingly slow.

Suggestions?

Update posted at 10:41 PM US Central Time on Mar 22, 2009: I ditched Data::Serializer and used bare Storable. The speed doubled! Data::Serializer is superbly easy, but too slow. Now, instead of 30-33 ms per transaction, I am getting 17-18 ms per transaction. Progress.

--

when small people start casting long shadows, it is time to go to bed

In reply to speeding up row by row lookup in a large db by punkish

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (3)
As of 2024-04-24 06:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found