Re^5: DBI::SQLite slowness

That's actually the same behaviour as other DB's have.

But only now do I see the initial thread/problem (Scaling Hash Limits). (It's useful to link to original threads in follow-up posts, you know). With the relatively small sizes involved, a database doesn't seem necessary.

If the problem is that simple, can't you just run

sort -u dupslist > no_dupslist
[download]

on your id list? Perhaps not very interesting, or fast (took about 7 minutes in a 100M test run here), but about as simple as it gets.

(BTW, just another datapoint (as I did the test already): PostgreSQL (9.4devel) loads about 9000 rows/s, on a slowish, low-end desktop. That's with the laborious INSERT-method that your script uses; bulk-loading (with COPY) loads ~ 1 million rows /second (excluding any de-duplication):

perl -e 'for (1..50_000_000){
   printf "%012d\n", $_;
}' > t_data.txt;

echo "
  drop table if exists t;
  create unlogged table t(klout integer);
" | psql;

echo "copy t from '/tmp/t_data.txt'; " | psql 
time < t_data.txt psql -c 'copy t from stdin'

real    0m25.661s
[download]

That's a rate of just under 2 million per second

)

UPDATE: added 'unlogged', adjusted timings (it makes the load twice as fast)

Comment on Re^5: DBI::SQLite slowness Select or Download Code