Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Basic Perl array intersection faster than mysql query join.

by punch_card_don (Curate)
on Oct 12, 2004 at 17:09 UTC ( [id://398606]=perlquestion: print w/replies, xml ) Need Help??

punch_card_don has asked for the wisdom of the Perl Monks concerning the following question:

Momusian Monks,

In trying to tune my DBI-driven database middleware, I tried what is certainly poor form and got a surprise.

The mysql database has just two columns, col_1 & col_2, both smallint, but over 12-million rows. Neither column can be guaranteed unique values, but combinations of them are definitely unique. So, I created a compound Primary Key.

I want to count how many rows with the same col_2 value have either x or y in col_1.

To benchmark the database performance independently of the Perl script, I telnetted in and did:

select count(*) from theTable A inner join theTable B where (A.col_1 = x and B.col_1 = y and A.col_2 = B.col_2);

which gave me an average response time of ~0.13 sec. (Many thanks to those in the other thread who encouraged me to try various index configurations)

Then, I wondered how much time the join was adding, so I wrote a perl script:

$sql = "SELECT col_2 FROM theTable WHERE col_1 = $value_1"; $sth = $dbh->prepare($sql) or die("Could not prepare!" . $dbh- +>errstr); $sth->execute() or die("Could not execute 1!" . $dbh->errstr); while ($data = $sth->fetchrow_array()) { push(@col_1, $data); } $sql = "SELECT col_2 FROM theTable WHERE col_1 = $value_2"; $sth = $dbh->prepare($sql) or die("Could not prepare!" . $dbh- +>errstr); $sth->execute() or die("Could not execute 2!" . $dbh->errstr); while ($data = $sth->fetchrow_array()) { push(@col_2, $data); } foreach $e (@col_2) { $union{$e} = 1 } foreach $e (@col_1) { if ( $union{$e} ) { $isect{$e} = 1 } } @isect = keys %isect;
Essentially doing the join by finding the intersection of the two query result arrays myself in Perl. This had an average response time of ~0.10 sec! I expect it will be evn better if I replace the repetitive prepares with placeholders.

Now, I expect to be savaged for moving database functinality to Perl code, but if it works better......

Replies are listed 'Best First'.
Re: Basic Perl array intersection faster than mysql query join.
by dragonchild (Archbishop) on Oct 12, 2004 at 17:36 UTC
    My group lead once rewrote a large complex query as a series of smaller queries and dealt with the resultsets in Perl. The operation as a whole went from 15 minutes to 10 seconds. Not only that, but the load on the database server went from 10 to 2, even though the number of queries went from 1 to 12.

    That said, I would stress your benchmark a little. A difference of 3 hundredths of a second is, frankly, well within the statistical noise of a server. I'd run the two versions at least 10,000 times, just to make sure you're not hitting anomalous OS scheduler choices, root processes waking up, and the like.

    Another impact might be your harddrive. Perl in enough RAM will generally beat a database on a single 7200rpm disk. However, move that to a mirrored 15,000 rpm pair of disks and you might have a different story.

    A further impact might be the cardinality of your indices. The higher the cardinality, the fewer items there are per index value. This means that the index is more efficient, in that it will handle each item better. Some databases might not use low-cardinality indices as efficiently in a self-joined query as they would a higher-cardinality index.

    Another item - your query should be rewritten.

    select count(*) from theTable A inner join theTable B where (A.col_1 = x and B.col_1 = y and A.col_2 = B.col_2); should become SELECT COUNT(*) FROM theTable A JOIN theTable B ON (A.col_2 = B.col_2) WHERE A.col_1 = x AND B.col_1 = y

    You have your join condition in your where clause. If you put it in your from clause, some optimizers might be able to take advantage of that.

    Being right, does not endow the right to be rude; politeness costs nothing.
    Being unknowing, is not the same as being stupid.
    Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
    Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

Re: Basic Perl array intersection faster than mysql query join.
by perrin (Chancellor) on Oct 12, 2004 at 17:52 UTC
    Be careful: you are heading for trouble. There are various reasons why joining in Perl will sometimes be faster than a database. One might be a poor choice of database, or an obsolete version. Another might be missing indexes or incorrect SQL. There are also many tuning options on most databases.

    The most obvious thing though is that in order to do this in Perl you have to retrieve all of the records that you might possibly be interested in so that you can examine them. In any database of non-trivial size, that would be a very slow operation. In short, your in-memory approach will lose to the database as soon as the amount of data to be examined reaches a point where fetching it all from the database is too slow.

      Also, speed isn't the only thing to consider... If you are ever going to alter the tables or multiple table joins, or try to add full text indexing, or try to do something more complicated, a database will help you. Using a database versus using a (insert language here) script to do what you need to do, is a tradeoff between speed, maintainability, and east of expandibility... If, though, you know that all you will be ever doing is joining those two tables, and you are ready to deal with the risk of doing the join in the code base, and you need the speed, go for it.


      ----
      Zak - the office
Re: Basic Perl array intersection faster than mysql query join.
by kvale (Monsignor) on Oct 12, 2004 at 17:19 UTC
    Memory access is much faster than disk access, so if you can keep your whole database in memory, calculation can potentially happen much faster. So if you don't need the subtle relational capabilities of SQL and it can all fit in RAM, by all means stick with hashes or arrays.

    -Mark

      I found the same thing with c and Oracle.

      Your memory/disc thing is a very good point, the other important point here is the access path, especially with complex joins and big tables. The acess path that the database pick might not be the most optimized one.

      One usually code for the best algorithm for a specific situation, but database has to follow some algorithm that is faster for most situations, but not the fastest for every situation.

      But my experience with Oracle, usually it is a good idea just let the database do whatever it thinks as right, unless the difference is too big. By doing this, you reduce the chance of bugs in your code. Some take give thing.

        The acess path that the database pick might not be the most optimized one.

        The EXPLAIN command can be useful when trying to figure out exactly what the database is doing, and how to optimise it.

Re: Basic Perl array intersection faster than mysql query join.
by Jenda (Abbot) on Oct 12, 2004 at 18:02 UTC

    Why do you use the @col_1 and @col_2 arrays? Also it's more efficient to store just undefs in the hashes and then use exists():

    $sql = "SELECT col_2 FROM theTable WHERE col_1 = $value_1"; $sth = $dbh->prepare($sql) or die("Could not prepare!" . $dbh->errstr) +; $sth->execute() or die("Could not execute 1!" . $dbh->errstr); while ($data = $sth->fetchrow_array()) { $union{$data} = undef; } $sql = "SELECT col_2 FROM theTable WHERE col_1 = $value_2"; $sth = $dbh->prepare($sql) or die("Could not prepare!" . $dbh->errstr) +; $sth->execute() or die("Could not execute 2!" . $dbh->errstr); while ($data = $sth->fetchrow_array()) { $isect{$data} = undef if exists $union{$data}; } @isect = keys %isect;

    Jenda
    We'd like to help you learn to help yourself
    Look around you, all you see are sympathetic eyes
    Stroll around the grounds until you feel at home
       -- P. Simon in Mrs. Robinson

Re: Basic Perl array intersection faster than mysql query join.
by punch_card_don (Curate) on Oct 12, 2004 at 17:36 UTC
    If that is a sound approach, then my entire project could be reduced to simple selects of the form
    select col_2 from theTable where col_1 = $some_value;
    and then finding the intersection of two or more arrays.

    In that case, I am on a mission to find the absolutely most efficient array-interesection-finder routine possible. Yes, I've read and tried solutions from perlfaq4, and variations found by a search here on PM. I wonder if for the very specific application of two lists, each between 4,000 and 10,000 elements long, there isn't some ultra-simple and rapid approach.

      Here's another thought - if your database will only ever be used to find intersections, why not pre-calculate all of them in some manner? Just because your database is normalized doesn't mean it's the right schema for your purpose.

      For example, I have selectively denormalized my reporting database, just to reduce the number of table joins. Since I control all writing to the database and the database is specialized to a given purpose, this is a good breaking of the rules.

      Being right, does not endow the right to be rude; politeness costs nothing.
      Being unknowing, is not the same as being stupid.
      Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
      Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

Re: Basic Perl array intersection faster than mysql query join.
by eric256 (Parson) on Oct 12, 2004 at 18:42 UTC

    I wasn't clear from reading this if you indexed both colums individualy or not. You should have an index on each colum. This should result in a giant decrease in look up times.


    ___________
    Eric Hodges

      A very good point, I agree with you.

      He said compound key, so must be with two columns. However when you look at his query, he used only one column, so most likely the database was doing a full table scan instead of index scan.

Re: Basic Perl array intersection faster than mysql query join.
by itub (Priest) on Oct 12, 2004 at 18:36 UTC
    I was writing my very own search engine (TM) and had the index in a mysql database. To process the queries I had to intersect lists with up to several million items. I found that, whatever I tried, mysql was too slow. My solution was to move the core of the intersection part of the engine to a custom C program with its own binary file for the index.

    This may not be pretty, but it was the only way I could find to get fast results and use a reasonable amount of memory (the overhead of Perl arrays becomes problematic when you have millions of items!).

    So I think that in some cases it is OK to move some operations out of the database if there is a clear benefit. I was getting a 100x speedup or better in some cases, but I don't think I would do it for just a 20% speedup. YMMV.

Re: Basic Perl array intersection faster than mysql query join.
by punch_card_don (Curate) on Oct 12, 2004 at 18:15 UTC
    Yes, the database will only ever be used for selects and to find intersections. There will never be any insertions, deletions, or anything else, except by administrators on a very exceptional basis should an error be found in the data.

    As for pre-calculation, yes, I plan to do what can be done. There are a couple hundred queries that will be repeated often, as their results form the base line for statistical comparison of other queries. But there are 9-million possible 'other queries', each with equal probability of being asked. It's these I think maybe can be reduced to simple selects.

    Perrin does, however, give me reason to pause. My fields are smallint, and no retrieval will be larger than 10,000 single-value records. Normally I'd be doing the intersection of a 10,000 element array with a ~4,000 element array of smallints. I have compound primary key as the sole index. So before I commit to heavily to the Perl solution, I'll try a few different database schemas and indices. Nice to nkow I have the Perl solution in my pocket, though.

    Off I go ...

Re: Basic Perl array intersection faster than mysql query join.
by SpanishInquisition (Pilgrim) on Oct 12, 2004 at 20:37 UTC
    SQL has overhead to prepare queries -- stored procedures will probably be faster. But *more importantly* SQL has an overhead to establish a connection. When your data scales to 9 million rows, I seriously doubt Perl will be faster. Of course, in that case, you need to invest in something a little more robust than MySQL... (or at least move to postgres).

    Do not confuse your test environment with your production environment when running benchmarks. Make sure both are equally complex and large. Also consider the network overhead, what kind of server you have running the DB, what kind of disks it has, etc. As your problem scales in complexity and data size, things can/will change.

    My point of reference is batch operations for AT&T databases ... for a simple run-on-my-linux-box-for-purpose-X application, yeah, that may be a moot point.

Re: Basic Perl array intersection faster than mysql query join.
by Beechbone (Friar) on Oct 13, 2004 at 09:18 UTC
    Just a small warning to all that want to copy this:

    Don't do this on a database that can be changed while you are working!

    A (real) database provides you some data integrity and consistency that won't work between multiple statements unless you manually start locking rows/tables.


    Search, Ask, Know
Re: Basic Perl array intersection faster than mysql query join.
by demerphq (Chancellor) on Oct 13, 2004 at 14:50 UTC

    I just did a similar experiment and got the opposite results. I think this type of thing will really depend on matters like load and number of records involved in the fetch. My experiment was done on Sybase with a 10 million record table with a very similar structure to yours. The benchmark code is as follows (part is omitted as it includes connection data :-)

    our $sel1=qq[ select count(*) ]; our $sel2=qq[ select count(A.wsf_id) ]; our $str=qq[ from wsf_price_stats A,wsf_price_stats B where A.wsf_id = 30472036 and B.wsf_id = 30472008 and A.zp_id = B.zp_id ]; use Benchmark qw(cmpthese); $|++; cmpthese 1000, { 'count(*)' => q[my $r=$dbh->selectrow_arrayref($sel1.$str) or + die "$str:".$dbh->errstr();], 'count(wsf_id)' => q[my $r=$dbh->selectrow_arrayref($sel2.$str) or + die "$str:".$dbh->errstr();], 'perl' => q[get_isect()], }; sub get_isect { my (@col_1,@col_2,%isect,%union); my $sql = "SELECT zp_id FROM wsf_price_stats WHERE wsf_id = 304720 +36"; my $sth = $dbh->prepare($sql) or die("Could not prepare!" . $dbh-> +errstr); $sth->execute() or die("Could not execute 1!" . $dbh->errstr); while (my $data = $sth->fetchrow_array()) { push(@col_1, $data); } $sql = "SELECT zp_id FROM wsf_price_stats WHERE wsf_id = 30472008" +; $sth = $dbh->prepare($sql) or die("Could not prepare!" . $dbh->err +str); $sth->execute() or die("Could not execute 2!" . $dbh->errstr); while (my $data = $sth->fetchrow_array()) { push(@col_2, $data); } foreach my $e (@col_2) { $union{$e} = 1 } foreach my $e (@col_1) { if ( $union{$e} ) { $isect{$e} = 1 } } [scalar keys %isect] } __END__ Benchmark: timing 1000 iterations of count(*), count(wsf_id), perl... count(*): 17 wallclock secs ( 0.53 usr + 0.20 sys = 0.73 CPU) @ 13 +60.54/s (n=1000) count(wsf_id): 16 wallclock secs ( 0.47 usr + 0.11 sys = 0.58 CPU) @ + 1730.10/s (n=1000) perl: 403 wallclock secs ( 6.33 usr + 0.22 sys = 6.55 CPU) @ 1 +52.77/s (n=1000) Rate perl count(*) count(wsf_id) perl 153/s -- -89% -91% count(*) 1361/s 791% -- -21% count(wsf_id) 1730/s 1033% 27% --

    The reason I include count(*) versus count(wsf_id) is because often doing the latter on an indexed field means that the DB can optimize the query better as it need not count the rows in the table but rather only the entries in the index. Anyway the results show that were I to use your technique id see a massive reduction in efficiency. I suggest you explore the typical case that you are dealing with. If you have been lucky in your tests and there isnt much data moving around etc then the results could be out of the average for the performance of your ap. Either way its likely this technique will neither scale nor be particularly portable. In the long run I suspect other avenues for optimizing your processes will prove to be more fruitful.


    ---
    demerphq

      First they ignore you, then they laugh at you, then they fight you, then you win.
      -- Gandhi

      Flux8


Re: Basic Perl array intersection faster than mysql query join.
by punch_card_don (Curate) on Oct 13, 2004 at 18:29 UTC
    Well..it has been an interesting session.

    Thanks again all for the input.

    To answer a question posed above:

    Yes, as described in the original post, it is a two-column database. The Primary Key is compound, on the two columns, like so:

    CREATE TABLE `theTable` ( `col_1` smallint(5) unsigned NOT NULL default '0', `col_2` smallint(5) unsigned NOT NULL default '0', PRIMARY KEY (`col_1`,`col_2`) ) TYPE=MyISAM
    12-million rows

    Anyway - liked the direct comparison routine above, Jenda's suggestions, and the suggested re-write of the sql, and so I tried them all, like so:

    #method 1 - Perl intersection using @col arrays $sql = "SELECT col_2 FROM theTable WHERE col_1 = ?"; $sth = $dbh->prepare($sql) or die("Could not prepare!" . $dbh->err +str); $start_1 = new Benchmark; for $i (0 .. $iterations) { $sth->execute($value_1) or die("Could not execute 1!" . $dbh-> +errstr); while ($data = $sth->fetchrow_array()) { push(@col_1, $data); } $sth->execute($value_2) or die("Could not execute 2!" . $dbh-> +errstr); while ($data = $sth->fetchrow_array()) { push(@col_2, $data); } foreach $e (@col_1) { $union{$e} = 1 } foreach $e (@col_2) { if ( $union{$e} ) { $isect{$e} = 1 } } @isect = keys %isect; } $end_1 = new Benchmark; $diff = timediff($end_1, $start_1); #method 2 - Perl intersection using 'undef' $sql = "SELECT col_2 FROM theTable WHERE col_1 = ?"; $sth = $dbh->prepare($sql) or die("Could not prepare!" . $dbh->err +str); $start_2 = new Benchmark; for $i (0 .. $iterations) { $sth->execute($value_1) or die("Could not execute 1!" . $dbh-> +errstr); while ($data = $sth->fetchrow_array()) { $union{$data} = undef; } $sth->execute($value_2) or die("Could not execute 2!" . $dbh-> +errstr); while ($data = $sth->fetchrow_array()) { $isect{$data} = undef if exists $union{$data}; } @isect = keys %isect; } $end_2 = new Benchmark; $diff = timediff($end_2, $start_2); #method 3 - MYSQL $sql = "SELECT COUNT(*) FROM theTable A JOIN responses B ON (A.col +_2 = B.col_2) WHERE A.col_1 = ? AND B.col_1 = ?"; $sth = $dbh->prepare($sql) or die("Could not prepare!" . $dbh->err +str); $start_3 = new Benchmark; for $i (0 .. $iterations) { $sth->execute($value_1, $value_2) or die("Could not execute 1! +" . $dbh->errstr); } $end_3 = new Benchmark; $diff = timediff($end_3, $start_3);
    Result? For 100 iterations:

    Method 1 - my original: 64 sec
    Method 2 - modified sql & 'undef': 13 sec
    Method 3 - SQL join: 15 sec

    Pretty stable over ~10 repeats. I'm ultra impressed by the difference between 1 & 2.

    But I'm jealous of the post above showing the same ~14 seconds for tens times as many iterations of the same query. What the heck?

      Rerun your benchmark, but swap the order of methods 1 and 2. I suspect you'll see a shocking difference.

      As a general note, when doing benchmarks, you should run the various options separately, not in the same script. That way, you keep them from affecting each other.

      Being right, does not endow the right to be rude; politeness costs nothing.
      Being unknowing, is not the same as being stupid.
      Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
      Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://398606]
Approved by blokhead
Front-paged by itub
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (10)
As of 2024-04-18 08:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found