Basic Perl array intersection faster than mysql query join.

punch_card_don has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Basic Perl array intersection faster than mysql query join.
by dragonchild (Archbishop) on Oct 12, 2004 at 17:36 UTC

That said, I would stress your benchmark a little. A difference of 3 hundredths of a second is, frankly, well within the statistical noise of a server. I'd run the two versions at least 10,000 times, just to make sure you're not hitting anomalous OS scheduler choices, root processes waking up, and the like.

Another impact might be your harddrive. Perl in enough RAM will generally beat a database on a single 7200rpm disk. However, move that to a mirrored 15,000 rpm pair of disks and you might have a different story.

A further impact might be the cardinality of your indices. The higher the cardinality, the fewer items there are per index value. This means that the index is more efficient, in that it will handle each item better. Some databases might not use low-cardinality indices as efficiently in a self-joined query as they would a higher-cardinality index.

Another item - your query should be rewritten.

select count(*)
  from theTable A
  inner join theTable B
 where (A.col_1 = x and B.col_1 = y and A.col_2 = B.col_2);

should become

SELECT COUNT(*)
  FROM theTable A JOIN theTable B ON (A.col_2 = B.col_2)
 WHERE A.col_1 = x
   AND B.col_1 = y
[download]

You have your join condition in your where clause. If you put it in your from clause, some optimizers might be able to take advantage of that.

Being right, does not endow the right to be rude; politeness costs nothing.
Being unknowing, is not the same as being stupid.
Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

[reply]
[d/l]

Re: Basic Perl array intersection faster than mysql query join.
by perrin (Chancellor) on Oct 12, 2004 at 17:52 UTC

The most obvious thing though is that in order to do this in Perl you have to retrieve all of the records that you might possibly be interested in so that you can examine them. In any database of non-trivial size, that would be a very slow operation. In short, your in-memory approach will lose to the database as soon as the amount of data to be examined reaches a point where fetching it all from the database is too slow.

[reply]

Re^2: Basic Perl array intersection faster than mysql query join.

by zakzebrowski (Curate) on Oct 18, 2004 at 13:56 UTC

know

----
Zak - the office

[reply]

Re: Basic Perl array intersection faster than mysql query join.
by kvale (Monsignor) on Oct 12, 2004 at 17:19 UTC

-Mark

[reply]

Re^2: Basic Perl array intersection faster than mysql query join.

by pg (Canon) on Oct 12, 2004 at 20:33 UTC

I found the same thing with c and Oracle.

Your memory/disc thing is a very good point, the other important point here is the access path, especially with complex joins and big tables. The acess path that the database pick might not be the most optimized one.

One usually code for the best algorithm for a specific situation, but database has to follow some algorithm that is faster for most situations, but not the fastest for every situation.

But my experience with Oracle, usually it is a good idea just let the database do whatever it thinks as right, unless the difference is too big. By doing this, you reduce the chance of bugs in your code. Some take give thing.

[reply]

Re^3: Basic Perl array intersection faster than mysql query join.

by DrHyde (Prior) on Oct 13, 2004 at 08:32 UTC

The acess path that the database pick might not be the most optimized one.

The EXPLAIN command can be useful when trying to figure out exactly what the database is doing, and how to optimise it.

[reply]

Re: Basic Perl array intersection faster than mysql query join.
by Jenda (Abbot) on Oct 12, 2004 at 18:02 UTC

Why do you use the @col_1 and @col_2 arrays? Also it's more efficient to store just undefs in the hashes and then use exists():

$sql = "SELECT col_2 FROM theTable WHERE col_1 = $value_1";
$sth = $dbh->prepare($sql) or die("Could not prepare!" . $dbh->errstr)
+;
$sth->execute() or die("Could not execute 1!" . $dbh->errstr);
while ($data = $sth->fetchrow_array()) {
    $union{$data} = undef;
}
$sql = "SELECT col_2 FROM theTable WHERE col_1 = $value_2";
$sth = $dbh->prepare($sql) or die("Could not prepare!" . $dbh->errstr)
+;
$sth->execute() or die("Could not execute 2!" . $dbh->errstr);
while ($data = $sth->fetchrow_array()) {
    $isect{$data} = undef if exists $union{$data};
}

@isect = keys %isect;
[download]

Jenda
We'd like to help you learn to help yourself
Look around you, all you see are sympathetic eyes
Stroll around the grounds until you feel at home
-- P. Simon in Mrs. Robinson

[reply]
[d/l]

Re: Basic Perl array intersection faster than mysql query join.
by punch_card_don (Curate) on Oct 12, 2004 at 17:36 UTC

select col_2 from theTable where col_1 = $some_value;
[download]

In that case, I am on a mission to find the absolutely most efficient array-interesection-finder routine possible. Yes, I've read and tried solutions from perlfaq4, and variations found by a search here on PM. I wonder if for the very specific application of two lists, each between 4,000 and 10,000 elements long, there isn't some ultra-simple and rapid approach.

[reply]
[d/l]

Re^2: Basic Perl array intersection faster than mysql query join.

by dragonchild (Archbishop) on Oct 12, 2004 at 17:40 UTC

For example, I have selectively denormalized my reporting database, just to reduce the number of table joins. Since I control all writing to the database and the database is specialized to a given purpose, this is a good breaking of the rules.

[reply]

Re: Basic Perl array intersection faster than mysql query join.
by eric256 (Parson) on Oct 12, 2004 at 18:42 UTC

I wasn't clear from reading this if you indexed both colums individualy or not. You should have an index on each colum. This should result in a giant decrease in look up times.

___________
Eric Hodges

[reply]

Re^2: Basic Perl array intersection faster than mysql query join.

by pg (Canon) on Oct 12, 2004 at 20:46 UTC

A very good point, I agree with you.

He said compound key, so must be with two columns. However when you look at his query, he used only one column, so most likely the database was doing a full table scan instead of index scan.

[reply]

Re: Basic Perl array intersection faster than mysql query join.
by itub (Priest) on Oct 12, 2004 at 18:36 UTC

This may not be pretty, but it was the only way I could find to get fast results and use a reasonable amount of memory (the overhead of Perl arrays becomes problematic when you have millions of items!).

So I think that in some cases it is OK to move some operations out of the database if there is a clear benefit. I was getting a 100x speedup or better in some cases, but I don't think I would do it for just a 20% speedup. YMMV.

[reply]

Re: Basic Perl array intersection faster than mysql query join.
by punch_card_don (Curate) on Oct 12, 2004 at 18:15 UTC

As for pre-calculation, yes, I plan to do what can be done. There are a couple hundred queries that will be repeated often, as their results form the base line for statistical comparison of other queries. But there are 9-million possible 'other queries', each with equal probability of being asked. It's these I think maybe can be reduced to simple selects.

Perrin does, however, give me reason to pause. My fields are smallint, and no retrieval will be larger than 10,000 single-value records. Normally I'd be doing the intersection of a 10,000 element array with a ~4,000 element array of smallints. I have compound primary key as the sole index. So before I commit to heavily to the Perl solution, I'll try a few different database schemas and indices. Nice to nkow I have the Perl solution in my pocket, though.

Off I go ...

[reply]

Re: Basic Perl array intersection faster than mysql query join.
by SpanishInquisition (Pilgrim) on Oct 12, 2004 at 20:37 UTC

Do not confuse your test environment with your production environment when running benchmarks. Make sure both are equally complex and large. Also consider the network overhead, what kind of server you have running the DB, what kind of disks it has, etc. As your problem scales in complexity and data size, things can/will change.

My point of reference is batch operations for AT&T databases ... for a simple run-on-my-linux-box-for-purpose-X application, yeah, that may be a moot point.

[reply]

Re: Basic Perl array intersection faster than mysql query join.
by Beechbone (Friar) on Oct 13, 2004 at 09:18 UTC

Don't do this on a database that can be changed while you are working!

A (real) database provides you some data integrity and consistency that won't work between multiple statements unless you manually start locking rows/tables.

_Search, Ask, ^Know

[reply]

Re: Basic Perl array intersection faster than mysql query join.
by demerphq (Chancellor) on Oct 13, 2004 at 14:50 UTC

I just did a similar experiment and got the opposite results. I think this type of thing will really depend on matters like load and number of records involved in the fetch. My experiment was done on Sybase with a 10 million record table with a very similar structure to yours. The benchmark code is as follows (part is omitted as it includes connection data :-)


our $sel1=qq[
    select count(*)
];
our $sel2=qq[
    select count(A.wsf_id)
];
our $str=qq[
    from wsf_price_stats A,wsf_price_stats B
    where A.wsf_id = 30472036
    and B.wsf_id   = 30472008
    and A.zp_id    = B.zp_id
];

use Benchmark qw(cmpthese);
$|++;
cmpthese 1000, {
    'count(*)'      => q[my $r=$dbh->selectrow_arrayref($sel1.$str) or
+ die "$str:".$dbh->errstr();],
    'count(wsf_id)' => q[my $r=$dbh->selectrow_arrayref($sel2.$str) or
+ die "$str:".$dbh->errstr();],
    'perl'          => q[get_isect()],
};

sub get_isect {
    my (@col_1,@col_2,%isect,%union);

    my $sql = "SELECT zp_id FROM wsf_price_stats WHERE wsf_id = 304720
+36";
    my $sth = $dbh->prepare($sql) or die("Could not prepare!" . $dbh->
+errstr);
    $sth->execute() or die("Could not execute 1!" . $dbh->errstr);
    while (my $data = $sth->fetchrow_array()) {
        push(@col_1, $data);
    }
    $sql = "SELECT zp_id FROM wsf_price_stats WHERE wsf_id = 30472008"
+;
    $sth = $dbh->prepare($sql) or die("Could not prepare!" . $dbh->err
+str);
    $sth->execute() or die("Could not execute 2!" . $dbh->errstr);
    while (my $data = $sth->fetchrow_array()) {
        push(@col_2, $data);
    }

    foreach my $e (@col_2) { $union{$e} = 1 }
    foreach my $e (@col_1) {
        if ( $union{$e} ) { $isect{$e} = 1 }
    }
    [scalar keys %isect]
}


__END__

Benchmark: timing 1000 iterations of count(*), count(wsf_id), perl...
  count(*): 17 wallclock secs ( 0.53 usr +  0.20 sys =  0.73 CPU) @ 13
+60.54/s (n=1000)
count(wsf_id): 16 wallclock secs ( 0.47 usr +  0.11 sys =  0.58 CPU) @
+ 1730.10/s (n=1000)
      perl: 403 wallclock secs ( 6.33 usr +  0.22 sys =  6.55 CPU) @ 1
+52.77/s (n=1000)
                Rate          perl      count(*) count(wsf_id)
perl           153/s            --          -89%          -91%
count(*)      1361/s          791%            --          -21%
count(wsf_id) 1730/s         1033%           27%            --
[download]

The reason I include count(*) versus count(wsf_id) is because often doing the latter on an indexed field means that the DB can optimize the query better as it need not count the rows in the table but rather only the entries in the index. Anyway the results show that were I to use your technique id see a massive reduction in efficiency. I suggest you explore the typical case that you are dealing with. If you have been lucky in your tests and there isnt much data moving around etc then the results could be out of the average for the performance of your ap. Either way its likely this technique will neither scale nor be particularly portable. In the long run I suspect other avenues for optimizing your processes will prove to be more fruitful.

---
demerphq

_{First they ignore you, then they laugh at you, then they fight you, then you win.

-- Gandhi

Flux8}

[reply]
[d/l]

Re: Basic Perl array intersection faster than mysql query join.
by punch_card_don (Curate) on Oct 13, 2004 at 18:29 UTC

Thanks again all for the input.

To answer a question posed above:

Yes, as described in the original post, it is a two-column database. The Primary Key is compound, on the two columns, like so:

CREATE TABLE `theTable` (
  `col_1` smallint(5) unsigned NOT NULL default '0',
  `col_2` smallint(5) unsigned NOT NULL default '0',
  PRIMARY KEY  (`col_1`,`col_2`)
) TYPE=MyISAM
[download]

Anyway - liked the direct comparison routine above, Jenda's suggestions, and the suggested re-write of the sql, and so I tried them all, like so:

    #method 1 - Perl intersection using @col arrays
    $sql = "SELECT col_2 FROM theTable WHERE col_1 = ?";
    $sth = $dbh->prepare($sql) or die("Could not prepare!" . $dbh->err
+str);
    $start_1 = new Benchmark;
    for $i (0 .. $iterations) {
        $sth->execute($value_1) or die("Could not execute 1!" . $dbh->
+errstr);
        while ($data = $sth->fetchrow_array()) {
            push(@col_1, $data);
        }
        $sth->execute($value_2) or die("Could not execute 2!" . $dbh->
+errstr);
        while ($data = $sth->fetchrow_array()) {
            push(@col_2, $data);
        }
        foreach $e (@col_1) { $union{$e} = 1 }
        foreach $e (@col_2) {
            if ( $union{$e} ) { $isect{$e} = 1 }
        }
        @isect = keys %isect;
    }
    $end_1 = new Benchmark;
    $diff = timediff($end_1, $start_1);

    #method 2 - Perl intersection using 'undef'
    $sql = "SELECT col_2 FROM theTable WHERE col_1 = ?";
    $sth = $dbh->prepare($sql) or die("Could not prepare!" . $dbh->err
+str);
    $start_2 = new Benchmark;
    for $i (0 .. $iterations) {
        $sth->execute($value_1) or die("Could not execute 1!" . $dbh->
+errstr);
        while ($data = $sth->fetchrow_array()) {
            $union{$data} = undef;
        }
        $sth->execute($value_2) or die("Could not execute 2!" . $dbh->
+errstr);
        while ($data = $sth->fetchrow_array()) {
            $isect{$data} = undef if exists $union{$data};
        }
        @isect = keys %isect;
    }
    $end_2 = new Benchmark;
    $diff = timediff($end_2, $start_2);

    #method 3 - MYSQL
    $sql = "SELECT COUNT(*) FROM theTable A JOIN responses B ON (A.col
+_2 = B.col_2) WHERE A.col_1 = ? AND B.col_1 = ?";
    $sth = $dbh->prepare($sql) or die("Could not prepare!" . $dbh->err
+str);
    $start_3 = new Benchmark;
    for $i (0 .. $iterations) {
        $sth->execute($value_1, $value_2) or die("Could not execute 1!
+" . $dbh->errstr);
    }
    $end_3 = new Benchmark;
    $diff = timediff($end_3, $start_3);
[download]

Method 1 - my original: 64 sec
Method 2 - modified sql & 'undef': 13 sec
Method 3 - SQL join: 15 sec

Pretty stable over ~10 repeats. I'm ultra impressed by the difference between 1 & 2.

But I'm jealous of the post above showing the same ~14 seconds for tens times as many iterations of the same query. What the heck?

[reply]
[d/l]
[select]

Re^2: Basic Perl array intersection faster than mysql query join.

by dragonchild (Archbishop) on Oct 17, 2004 at 14:57 UTC

As a general note, when doing benchmarks, you should run the various options separately, not in the same script. That way, you keep them from affecting each other.

[reply]