in reply to Basic Perl array intersection faster than mysql query join.

Well..it has been an interesting session.

Thanks again all for the input.

To answer a question posed above:

Yes, as described in the original post, it is a two-column database. The Primary Key is compound, on the two columns, like so:

CREATE TABLE `theTable` ( `col_1` smallint(5) unsigned NOT NULL default '0', `col_2` smallint(5) unsigned NOT NULL default '0', PRIMARY KEY (`col_1`,`col_2`) ) TYPE=MyISAM
12-million rows

Anyway - liked the direct comparison routine above, Jenda's suggestions, and the suggested re-write of the sql, and so I tried them all, like so:

#method 1 - Perl intersection using @col arrays $sql = "SELECT col_2 FROM theTable WHERE col_1 = ?"; $sth = $dbh->prepare($sql) or die("Could not prepare!" . $dbh->err +str); $start_1 = new Benchmark; for $i (0 .. $iterations) { $sth->execute($value_1) or die("Could not execute 1!" . $dbh-> +errstr); while ($data = $sth->fetchrow_array()) { push(@col_1, $data); } $sth->execute($value_2) or die("Could not execute 2!" . $dbh-> +errstr); while ($data = $sth->fetchrow_array()) { push(@col_2, $data); } foreach $e (@col_1) { $union{$e} = 1 } foreach $e (@col_2) { if ( $union{$e} ) { $isect{$e} = 1 } } @isect = keys %isect; } $end_1 = new Benchmark; $diff = timediff($end_1, $start_1); #method 2 - Perl intersection using 'undef' $sql = "SELECT col_2 FROM theTable WHERE col_1 = ?"; $sth = $dbh->prepare($sql) or die("Could not prepare!" . $dbh->err +str); $start_2 = new Benchmark; for $i (0 .. $iterations) { $sth->execute($value_1) or die("Could not execute 1!" . $dbh-> +errstr); while ($data = $sth->fetchrow_array()) { $union{$data} = undef; } $sth->execute($value_2) or die("Could not execute 2!" . $dbh-> +errstr); while ($data = $sth->fetchrow_array()) { $isect{$data} = undef if exists $union{$data}; } @isect = keys %isect; } $end_2 = new Benchmark; $diff = timediff($end_2, $start_2); #method 3 - MYSQL $sql = "SELECT COUNT(*) FROM theTable A JOIN responses B ON (A.col +_2 = B.col_2) WHERE A.col_1 = ? AND B.col_1 = ?"; $sth = $dbh->prepare($sql) or die("Could not prepare!" . $dbh->err +str); $start_3 = new Benchmark; for $i (0 .. $iterations) { $sth->execute($value_1, $value_2) or die("Could not execute 1! +" . $dbh->errstr); } $end_3 = new Benchmark; $diff = timediff($end_3, $start_3);
Result? For 100 iterations:

Method 1 - my original: 64 sec
Method 2 - modified sql & 'undef': 13 sec
Method 3 - SQL join: 15 sec

Pretty stable over ~10 repeats. I'm ultra impressed by the difference between 1 & 2.

But I'm jealous of the post above showing the same ~14 seconds for tens times as many iterations of the same query. What the heck?

Replies are listed 'Best First'.
Re^2: Basic Perl array intersection faster than mysql query join.
by dragonchild (Archbishop) on Oct 17, 2004 at 14:57 UTC
    Rerun your benchmark, but swap the order of methods 1 and 2. I suspect you'll see a shocking difference.

    As a general note, when doing benchmarks, you should run the various options separately, not in the same script. That way, you keep them from affecting each other.

    Being right, does not endow the right to be rude; politeness costs nothing.
    Being unknowing, is not the same as being stupid.
    Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
    Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.