comment on

Monadelphous Monks,

In my quest for speed in data retrieval, I'm experimenting with both common and uncommon structures and routines, with surprising results. This has allowed an interesting comparison of different approaches using DBI, or not. I would appreciate critiques/ideas.

Background

I have a datafile of multiple-choice questionnare results, in the form:

respondent_id: response_x, response_y, ...
[download]

There are ~20,000 respondents and 3,000 possible responses, each with it's own ID, although your average respondent would have chosen only about 1/5 of possible responses. So the raw data looks like this:

respondent_1: 123, 456, 789, 23, 1574, ...
respondent_2: 65, 1893, 2853, 753, ...
etc.
[download]

The data is compiled off-line. Once compiled, it never changes, so the speed of inserts, updates, deletions, etc are of absolutely no concern.

What the data is used for is to make cross-referenced queries of the form

count all respondents who gave response x and response y
[download]

There may be hundreds of such queries to produce a single report. All possible values of x and y are equally likely to be searched on, and there are millions of possible permutations and combinations, so there's no opportunity to pre-generate reports - they have to be generated on the fly. The speed of SELECTs, therefore, is of paramount importance. Hence these experiments.

The Experiments

I've tried various mysql schemas, populated them with the same test data, and then benchmarked standard queries. The test data is the full 20,000 respondents x average 600 responses per respondent, so it accurately reflects the real application.

db version structure

DB 0
the obvious respondent-per-row, column-per-response_id format
CREATE TABLE theTable (respondent, response_1, rsesponse_2, ...response_3000)
where all response columns are tinyint(1) and filled with 1 or 0, produces a 20,000 x 3,000 table.

DB 1
A single two-column table with compound index

CREATE TABLE `theTable` ( `response` smallint(5) unsigned NOT NULL default '0', `respondent` smallint(5) unsigned NOT NULL default '0', PRIMARY KEY (`response`,`respondent`) )
[download]

requiring a row for each response of each respondent = ~ 12-million rows

DB 2
DB 1 with keys added to each column

CREATE TABLE `theTable` ( `respondent` smallint(5) unsigned NOT NULL default '0', `response` smallint(5) unsigned NOT NULL default +'0', PRIMARY KEY (`respondent`,`response`), KEY `respondent` (`respondent`) +, KEY `response` (`response`) )
[download]

DB 3
Getting creative - trying to minimize the amount of data sifting per query - tried one table per response id as an inverted index. That is, each table is a list of respondents who gave that response:

for $i (1 .. $num_responses) { $sql = "CREATE TABLE r".$i." (respondent smalli +nt unsigned, PRIMARY KEY (respondent)) $sth = $dbh->prepare($sql) or die("Could not prepare!" . $dbh- +>errstr); $sth->execute() or die("Could not execute!" . $sth->errstr); }
[download]

DB 4
Took DB 3 a step further: Wondering about all this database and DBI overhead, but still interested in minimzing the data has has to be loaded and manipulated for a query, fell back on really old basics. Forgot about mysql, and just created one text file for each table in DB 3. Idea is to just load two small text files into arrays and find their intersection. So the "create" routine looks like;

for $i (1 .. $num_responses) { $file = ">response_".$i; open(FILE, $file) or dienice("cannot open file : $_[0] $!"); }
[download]

Benchmarking

I put all the mysql databases through two tests:

A standard SQL query benchmark using JOINS where necessary, like this:

    $sql = "appropriate select statement";
    $sth = $dbh->prepare($sql) or die("Could not prepare!" . $dbh->err
+str);
    $start_2 = new Benchmark;
    for $i (0 .. $iterations) {
        $sth->execute() or die("Could not execute 1!" . $dbh->errstr);
    }
    $count = $sth->fetchrow_array();
    $end_2 = new Benchmark;
    $diff_2 = timediff($end_2, $start_2);
[download]

Where the Selects were as follows:

db version	query
DB 0	SELECT COUNT(*) FROM theTableWHERE (r_x = 1 and r_y = 1)
DB 1	SELECT COUNT(*) FROM theTable A JOIN theTable B ON (A.respondent = B.respondent) WHERE A.response = ? AND B.response = ?
DB 2	Same as DB 1
DB 3	SELECT COUNT(*) FROM t_x JOIN t_y ON (t_x.respondent = t6_y.respondent)

Separate queries for each response_id, then finding the intersection of the results in Perl, like this:

db version	Perl Intersection Method
DB 0	$sql = "appropriate select for response_id_1" $sth = $dbh->prepare($sql) or die("Could not prepare!" . $dbh->err +str); $sql = "appropriate select for response_id_2" $stm = $dbh->prepare($sql) or die("Could not prepare!" . $dbh->err +str); $start_1 = new Benchmark; for $i (0 .. $iterations) { $sth->execute() or die("Could not execute 1!" . $dbh->errstr); while ($data = $sth->fetchrow_array()) { $union{$data} = undef; } $stm->execute() or die("Could not execute 2!" . $dbh->errstr); while ($data = $stm->fetchrow_array()) { $isect{$data} = undef if exists $union{$data}; } @isect = keys %isect; } $end_1 = new Benchmark; $diff = timediff($end_1, $start_1); [download]
DB 1
DB 2
DB 3

And I put the simple text-file "database", DB 4, through a similar array-intersection benchmark routine:

    $start_1 = new Benchmark;
    for $i (0 .. $iterations) {
        $table = "<".$db_dir."/r".$response_id;
              open(FILE, $table) or dienice("cannot open file : $table
+ $!");
        while ($data = <FILE>) {
            $union{$data} = undef;
        }
        close(FILE);
        $table = "<".$db_dir."/r6000";
              open(FILE, $table) or dienice("cannot open file : $table
+ $!");
        while ($data = <FILE>) {
            $isect{$data} = undef if exists $union{$data};
        }
        close(FILE);
        @isect = keys %isect;
        $response_id++;
    }
    $end_1 = new Benchmark;
    $diff = timediff($end_1, $start_1);
[download]

Note especially the "$response_id++;" to ensure that I'm retrieving a new file each iteration to be sure I'm not just working in cache.

Here are the benchmark results. These are average times for execution of the select and join/intersection necessary to produce the sought after count, in seconds, after repeating the benchmarks ~20 times over two days, with iterations varying from 50 to 100:

db version SQL Method Perl Intersection Method

DB 0 0.25 1.0

DB 1 0.13 0.14

DB 2 0.25 0.25

DB 3 0.08 0.12

DB 4 N/A 0.03

So here's my conundrum. If I believe I've done everything right, then for this specific application, a plain old text file system is outperforming every imaginable version of mysql w/ DBI, by a factor of 10 compared to the "standard" structures.

I don't want to abandon mysql & DBI, but for this specific application, where select speed is everything, it's tempting. I guess what I'm looking for from my fellow monks - those intrepid enough to have read down this far - is an idea of - am I totally out to lunch? Have I done something really stupid to arrive at these results? Or do these rsults make sense, and it's just accepted that we all knowingly give up speed every day in favour of the convenience of SQL and DBI?

Janitored by Arunbear - added readmore tags, as per Monastery guidelines

In reply to Basic Perl trumps DBI? Or my poor DB design? by punch_card_don

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.

db version	structure
DB 0	the obvious respondent-per-row, column-per-response_id format `CREATE TABLE theTable (respondent, response_1, rsesponse_2, ...response_3000)` where all response columns are tinyint(1) and filled with 1 or 0, produces a 20,000 x 3,000 table.

DB 1	A single two-column table with compound index CREATE TABLE `theTable` ( `response` smallint(5) unsigned NOT NULL default '0', `respondent` smallint(5) unsigned NOT NULL default '0', PRIMARY KEY (`response`,`respondent`) ) [download] requiring a row for each response of each respondent = ~ 12-million rows
DB 2	DB 1 with keys added to each column CREATE TABLE `theTable` ( `respondent` smallint(5) unsigned NOT NULL default '0', `response` smallint(5) unsigned NOT NULL default +'0', PRIMARY KEY (`respondent`,`response`), KEY `respondent` (`respondent`) +, KEY `response` (`response`) ) [download]
DB 3	Getting creative - trying to minimize the amount of data sifting per query - tried one table per response id as an inverted index. That is, each table is a list of respondents who gave that response: `for $i (1 .. $num_responses) { $sql = "CREATE TABLE r".$i." (respondent smalli +nt unsigned, PRIMARY KEY (respondent)) $sth = $dbh->prepare($sql) or die("Could not prepare!" . $dbh- +>errstr); $sth->execute() or die("Could not execute!" . $sth->errstr); }` [download]
DB 4	Took DB 3 a step further: Wondering about all this database and DBI overhead, but still interested in minimzing the data has has to be loaded and manipulated for a query, fell back on really old basics. Forgot about mysql, and just created one text file for each table in DB 3. Idea is to just load two small text files into arrays and find their intersection. So the "create" routine looks like; `for $i (1 .. $num_responses) { $file = ">response_".$i; open(FILE, $file) or dienice("cannot open file : $_[0] $!"); }` [download]

db version	SQL Method	Perl Intersection Method
DB 0	0.25	1.0
DB 1	0.13	0.14
DB 2	0.25	0.25
DB 3	0.08	0.12
DB 4	N/A	0.03