In my quest for speed in data retrieval, I'm experimenting with both common and uncommon structures and routines, with surprising results. This has allowed an interesting comparison of different approaches using DBI, or not. I would appreciate critiques/ideas.
Background
I have a datafile of multiple-choice questionnare results, in the form:
There are ~20,000 respondents and 3,000 possible responses, each with it's own ID, although your average respondent would have chosen only about 1/5 of possible responses. So the raw data looks like this:respondent_id: response_x, response_y, ...
The data is compiled off-line. Once compiled, it never changes, so the speed of inserts, updates, deletions, etc are of absolutely no concern.respondent_1: 123, 456, 789, 23, 1574, ... respondent_2: 65, 1893, 2853, 753, ... etc.
What the data is used for is to make cross-referenced queries of the form
There may be hundreds of such queries to produce a single report. All possible values of x and y are equally likely to be searched on, and there are millions of possible permutations and combinations, so there's no opportunity to pre-generate reports - they have to be generated on the fly. The speed of SELECTs, therefore, is of paramount importance. Hence these experiments.count all respondents who gave response x and response y
The Experiments
I've tried various mysql schemas, populated them with the same test data, and then benchmarked standard queries. The test data is the full 20,000 respondents x average 600 responses per respondent, so it accurately reflects the real application.
| db version | structure |
| DB 0 |
the obvious respondent-per-row, column-per-response_id format CREATE TABLE theTable (respondent, response_1, rsesponse_2, ...response_3000)where all response columns are tinyint(1) and filled with 1 or 0, produces a 20,000 x 3,000 table. |
| DB 1 |
A single two-column table with compound index
requiring a row for each response of each respondent = ~ 12-million rows |
| DB 2 |
DB 1 with keys added to each column
|
| DB 3 |
Getting creative - trying to minimize the amount of data sifting per query - tried one table per response id as an inverted index. That is, each table is a list of respondents who gave that response:
|
| DB 4 |
Took DB 3 a step further: Wondering about all this database and DBI overhead, but still interested in minimzing the data has has to be loaded and manipulated for a query, fell back on really old basics. Forgot about mysql, and just created one text file for each table in DB 3. Idea is to just load two small text files into arrays and find their intersection. So the "create" routine looks like;
|
Benchmarking
I put all the mysql databases through two tests:
Where the Selects were as follows:$sql = "appropriate select statement"; $sth = $dbh->prepare($sql) or die("Could not prepare!" . $dbh->err +str); $start_2 = new Benchmark; for $i (0 .. $iterations) { $sth->execute() or die("Could not execute 1!" . $dbh->errstr); } $count = $sth->fetchrow_array(); $end_2 = new Benchmark; $diff_2 = timediff($end_2, $start_2);
| db version | query |
| DB 0 | SELECT COUNT(*) FROM theTableWHERE (r_x = 1 and r_y = 1) |
| DB 1 | SELECT COUNT(*) FROM theTable A JOIN theTable B ON (A.respondent = B.respondent) WHERE A.response = ? AND B.response = ? |
| DB 2 | Same as DB 1 |
| DB 3 | SELECT COUNT(*) FROM t_x JOIN t_y ON (t_x.respondent = t6_y.respondent) |
| db version | Perl Intersection Method |
| DB 0 |
|
| DB 1 | |
| DB 2 | |
| DB 3 |
And I put the simple text-file "database", DB 4, through a similar array-intersection benchmark routine:
$start_1 = new Benchmark; for $i (0 .. $iterations) { $table = "<".$db_dir."/r".$response_id; open(FILE, $table) or dienice("cannot open file : $table + $!"); while ($data = <FILE>) { $union{$data} = undef; } close(FILE); $table = "<".$db_dir."/r6000"; open(FILE, $table) or dienice("cannot open file : $table + $!"); while ($data = <FILE>) { $isect{$data} = undef if exists $union{$data}; } close(FILE); @isect = keys %isect; $response_id++; } $end_1 = new Benchmark; $diff = timediff($end_1, $start_1);
Note especially the "$response_id++;" to ensure that I'm retrieving a new file each iteration to be sure I'm not just working in cache.
Here are the benchmark results. These are average times for execution of the select and join/intersection necessary to produce the sought after count, in seconds, after repeating the benchmarks ~20 times over two days, with iterations varying from 50 to 100:
| db version | SQL Method | Perl Intersection Method |
| DB 0 | 0.25 | 1.0 |
| DB 1 | 0.13 | 0.14 |
| DB 2 | 0.25 | 0.25 |
| DB 3 | 0.08 | 0.12 |
| DB 4 | N/A | 0.03 |
So here's my conundrum. If I believe I've done everything right, then for this specific application, a plain old text file system is outperforming every imaginable version of mysql w/ DBI, by a factor of 10 compared to the "standard" structures.
I don't want to abandon mysql & DBI, but for this specific application, where select speed is everything, it's tempting. I guess what I'm looking for from my fellow monks - those intrepid enough to have read down this far - is an idea of - am I totally out to lunch? Have I done something really stupid to arrive at these results? Or do these rsults make sense, and it's just accepted that we all knowingly give up speed every day in favour of the convenience of SQL and DBI?
Janitored by Arunbear - added readmore tags, as per Monastery guidelines
In reply to Basic Perl trumps DBI? Or my poor DB design? by punch_card_don
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |