Part 1


I have the following schema in a SQLite db that is 430 MB on my 2.4 GHz Core 2 duo Macbook laptop's 320 GB HFS+ formatted 7200 RPM disk with an 8 MB cache. The laptop has 4 GB RAM. update: This will eventually reside on a quad core Xeon 3 GHz Xserve with 32 GB RAM and a 1 TB 7200 RPM disk (don't know its cache).

-- 1000,000 rows CREATE TABLE cell ( cell_id INTEGER PRIMARY KEY, met_cell_id INTEGER, 8 other INTEGER or REAL columns ) -- 38 rows CREATE TABLE lc ( lc_id INTEGER PRIMARY KEY, 56 other INTEGER or REAL columns ) -- 10 rows CREATE TABLE dist ( dist_id INTEGER PRIMARY KEY, 37 other INTEGER or REAL columns ) -- 2,920,000 rows CREATE TABLE met ( met_id INTEGER PRIMARY KEY, met_cell_id INTEGER, 9 other INTEGER or REAL columns ) CREATE TABLE cell_lc (cell_id INTEGER, lc_id INTEGER) CREATE TABLE cell_dist (cell_id INTEGER, dist_id INTEGER) CREATE INDEX idx_met_cell_id ON met (met_cell_id) CREATE INDEX idx_cell_lc ON cell_lc (cell_id) CREATE INDEX idx_cell_dist ON cell_dist (cell_id)

I also have an R*Tree index, but that is a different story, not relevant here.

I retrieve *all* data for one cell ':cell_id' using the following queries

[1] First retrieve all data from cell table SELECT * FROM cell WHERE cell_id = :cell_id [2] Now retrieve the related lc, dist and met SELECT lc.* FROM lc l JOIN cell_lc c on l.lc_id = c.lc_id WHERE c.cell_id = :cell_id [3] Retrieve the related dist SELECT d.* FROM dist d JOIN cell_lc c on d.dist_id = c.dist_id WHERE c.cell_id = :cell_id [4] Retrieve the related met SELECT * FROM met WHERE met_cell_id = <met_cell_id from query [1] abov +e>

I did some benchmarking with the above schema using DBD::SQLite, and I get about 30 transactions per second as long as I return the data to memory.

[08:38 AM] ~/Data/carbonmodel$perl carbonmodel.pl timethis 1: 0 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU) @ 33 +.33/s (n=1) timethis 10: 0 wallclock secs ( 0.31 usr + 0.02 sys = 0.33 CPU) @ 3 +0.30/s (n=10) timethis 100: 3 wallclock secs ( 2.85 usr + 0.20 sys = 3.05 CPU) @ +32.79/s (n=100) timethis 1000: 33 wallclock secs (31.08 usr + 1.22 sys = 32.30 CPU) @ + 30.96/s (n=1000)

if I write the data to file, the speed drops to about 1 transaction per second

timethis 1000: 783 wallclock secs (732.26 usr + 18.22 sys = 750.48 CPU +) @ 1.33/s (n=1000)

Even if I stick with manipulating the data in memory, at 30 transactions per second (or 33 ms per transaction), it would take more than 9 hours to query each of the 1 million cells one by one.

In the real world, I will first find the relevant cell ids based on lat-lon bounds (hence my R*Tree index) and then extract their data one by one.

How can I, if at all, speed this up?

Part 2


Alternatively, I could denormalize the data completely. Inspired by a post on the Flickr blog (Building Fast Client-side Searches), in particular the para

To make this data available quickly from the server, we maintain and update a per-member cache in our database, where we store each member’s contact list in a text blob — this way it’s a single quick DB query to retrieve it. We can format this blob in any way we want: XML, JSON, etc"

I decided to experiment with the same technique. So...

CREATE TABLE cell_blobs (cell_id INTEGER PRIMARY KEY, cell_data BLOB);

I then queried each cell as in Part 1, serialized it and stored it in the cell_blobs table. My intent is to simply retrieve a BLOB and deserialize it... (I am using the excellent Data::Serializer for the job) it would *possibly* be quicker than 33 ms per retrieval. Well, I haven't yet completed this test because each BLOB is taking about 430 KB (with compress on). At 1 million rows, that is going to occupy upward of 400 GB. I broke the load_blob_table routine after about a third of the records had been processed because I found even the loading_the_blobs to be excruciatingly slow.

Suggestions?

Update posted at 10:41 PM US Central Time on Mar 22, 2009: I ditched Data::Serializer and used bare Storable. The speed doubled! Data::Serializer is superbly easy, but too slow. Now, instead of 30-33 ms per transaction, I am getting 17-18 ms per transaction. Progress.

--

when small people start casting long shadows, it is time to go to bed

In reply to speeding up row by row lookup in a large db by punkish

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.