comment on

Part 1

I have the following schema in a SQLite db that is 430 MB on my 2.4 GHz Core 2 duo Macbook laptop's 320 GB HFS+ formatted 7200 RPM disk with an 8 MB cache. The laptop has 4 GB RAM. update: This will eventually reside on a quad core Xeon 3 GHz Xserve with 32 GB RAM and a 1 TB 7200 RPM disk (don't know its cache).

-- 1000,000 rows
CREATE TABLE cell (
  cell_id INTEGER PRIMARY KEY,
  met_cell_id INTEGER,
  8 other INTEGER or REAL columns
)

-- 38 rows
CREATE TABLE lc (
  lc_id INTEGER PRIMARY KEY,
  56 other INTEGER or REAL columns
)

-- 10 rows
CREATE TABLE dist (
  dist_id INTEGER PRIMARY KEY,
  37 other INTEGER or REAL columns
)

-- 2,920,000 rows
CREATE TABLE met (
  met_id INTEGER PRIMARY KEY,
  met_cell_id INTEGER,
  9 other INTEGER or REAL columns
)

CREATE TABLE cell_lc (cell_id INTEGER, lc_id INTEGER)
CREATE TABLE cell_dist (cell_id INTEGER, dist_id INTEGER)

CREATE INDEX idx_met_cell_id ON met (met_cell_id)
CREATE INDEX idx_cell_lc ON cell_lc (cell_id)
CREATE INDEX idx_cell_dist ON cell_dist (cell_id)
[download]

I also have an R*Tree index, but that is a different story, not relevant here.

I retrieve *all* data for one cell ':cell_id' using the following queries

[1] First retrieve all data from cell table
SELECT * FROM cell WHERE cell_id = :cell_id

[2] Now retrieve the related lc, dist and met
SELECT lc.* 
FROM lc l JOIN cell_lc c on l.lc_id = c.lc_id 
WHERE c.cell_id = :cell_id

[3] Retrieve the related dist
SELECT d.* 
FROM dist d JOIN cell_lc c on d.dist_id = c.dist_id 
WHERE c.cell_id = :cell_id

[4] Retrieve the related met
SELECT * FROM met WHERE met_cell_id = <met_cell_id from query [1] abov
+e>
[download]

I did some benchmarking with the above schema using DBD::SQLite, and I get about 30 transactions per second as long as I return the data to memory.

[08:38 AM] ~/Data/carbonmodel$perl carbonmodel.pl

timethis 1:  0 wallclock secs ( 0.03 usr +  0.00 sys =  0.03 CPU) @ 33
+.33/s (n=1)

timethis 10:  0 wallclock secs ( 0.31 usr +  0.02 sys =  0.33 CPU) @ 3
+0.30/s (n=10)

timethis 100:  3 wallclock secs ( 2.85 usr +  0.20 sys =  3.05 CPU) @ 
+32.79/s (n=100)

timethis 1000: 33 wallclock secs (31.08 usr +  1.22 sys = 32.30 CPU) @
+ 30.96/s (n=1000)
[download]

if I write the data to file, the speed drops to about 1 transaction per second

timethis 1000: 783 wallclock secs (732.26 usr + 18.22 sys = 750.48 CPU
+) @  1.33/s (n=1000)
[download]

Even if I stick with manipulating the data in memory, at 30 transactions per second (or 33 ms per transaction), it would take more than 9 hours to query each of the 1 million cells one by one.

In the real world, I will first find the relevant cell ids based on lat-lon bounds (hence my R*Tree index) and then extract their data one by one.

How can I, if at all, speed this up?

Part 2

Alternatively, I could denormalize the data completely. Inspired by a post on the Flickr blog (Building Fast Client-side Searches), in particular the para

To make this data available quickly from the server, we maintain and update a per-member cache in our database, where we store each member’s contact list in a text blob — this way it’s a single quick DB query to retrieve it. We can format this blob in any way we want: XML, JSON, etc"

I decided to experiment with the same technique. So...

CREATE TABLE cell_blobs (cell_id INTEGER PRIMARY KEY, cell_data BLOB);
[download]

I then queried each cell as in Part 1, serialized it and stored it in the cell_blobs table. My intent is to simply retrieve a BLOB and deserialize it... (I am using the excellent Data::Serializer for the job) it would *possibly* be quicker than 33 ms per retrieval. Well, I haven't yet completed this test because each BLOB is taking about 430 KB (with compress on). At 1 million rows, that is going to occupy upward of 400 GB. I broke the load_blob_table routine after about a third of the records had been processed because I found even the loading_the_blobs to be excruciatingly slow.

Suggestions?

Update posted at 10:41 PM US Central Time on Mar 22, 2009: I ditched Data::Serializer and used bare Storable. The speed doubled! Data::Serializer is superbly easy, but too slow. Now, instead of 30-33 ms per transaction, I am getting 17-18 ms per transaction. Progress.

--

when small people start casting long shadows, it is time to go to bed

In reply to speeding up row by row lookup in a large db by punkish

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.