in reply to Basic Perl array intersection faster than mysql query join.

I just did a similar experiment and got the opposite results. I think this type of thing will really depend on matters like load and number of records involved in the fetch. My experiment was done on Sybase with a 10 million record table with a very similar structure to yours. The benchmark code is as follows (part is omitted as it includes connection data :-)

our $sel1=qq[ select count(*) ]; our $sel2=qq[ select count(A.wsf_id) ]; our $str=qq[ from wsf_price_stats A,wsf_price_stats B where A.wsf_id = 30472036 and B.wsf_id = 30472008 and A.zp_id = B.zp_id ]; use Benchmark qw(cmpthese); $|++; cmpthese 1000, { 'count(*)' => q[my $r=$dbh->selectrow_arrayref($sel1.$str) or + die "$str:".$dbh->errstr();], 'count(wsf_id)' => q[my $r=$dbh->selectrow_arrayref($sel2.$str) or + die "$str:".$dbh->errstr();], 'perl' => q[get_isect()], }; sub get_isect { my (@col_1,@col_2,%isect,%union); my $sql = "SELECT zp_id FROM wsf_price_stats WHERE wsf_id = 304720 +36"; my $sth = $dbh->prepare($sql) or die("Could not prepare!" . $dbh-> +errstr); $sth->execute() or die("Could not execute 1!" . $dbh->errstr); while (my $data = $sth->fetchrow_array()) { push(@col_1, $data); } $sql = "SELECT zp_id FROM wsf_price_stats WHERE wsf_id = 30472008" +; $sth = $dbh->prepare($sql) or die("Could not prepare!" . $dbh->err +str); $sth->execute() or die("Could not execute 2!" . $dbh->errstr); while (my $data = $sth->fetchrow_array()) { push(@col_2, $data); } foreach my $e (@col_2) { $union{$e} = 1 } foreach my $e (@col_1) { if ( $union{$e} ) { $isect{$e} = 1 } } [scalar keys %isect] } __END__ Benchmark: timing 1000 iterations of count(*), count(wsf_id), perl... count(*): 17 wallclock secs ( 0.53 usr + 0.20 sys = 0.73 CPU) @ 13 +60.54/s (n=1000) count(wsf_id): 16 wallclock secs ( 0.47 usr + 0.11 sys = 0.58 CPU) @ + 1730.10/s (n=1000) perl: 403 wallclock secs ( 6.33 usr + 0.22 sys = 6.55 CPU) @ 1 +52.77/s (n=1000) Rate perl count(*) count(wsf_id) perl 153/s -- -89% -91% count(*) 1361/s 791% -- -21% count(wsf_id) 1730/s 1033% 27% --

The reason I include count(*) versus count(wsf_id) is because often doing the latter on an indexed field means that the DB can optimize the query better as it need not count the rows in the table but rather only the entries in the index. Anyway the results show that were I to use your technique id see a massive reduction in efficiency. I suggest you explore the typical case that you are dealing with. If you have been lucky in your tests and there isnt much data moving around etc then the results could be out of the average for the performance of your ap. Either way its likely this technique will neither scale nor be particularly portable. In the long run I suspect other avenues for optimizing your processes will prove to be more fruitful.


---
demerphq

    First they ignore you, then they laugh at you, then they fight you, then you win.
    -- Gandhi

    Flux8