in reply to DBI::SQLite slowness
Assumption: this data you are de-duping is downloaded fresh, daily from TwitFace.
The idea of loading 180 million records into a db on disk in order to de-dup it is ridiculous if you are in any way concerned with speed.
The following shows a 10-line perl script de-duping a 200-million line, 2.8 GB file of 12-digit numbers in a little over 2 1/2 minutes, using less than 30 MB of ram to do so:
C:\test>dir 1054929.dat 20/09/2013 04:22 2,800,000,000 1054929.dat C:\test>wc -l 1054929.dat 200000000 1054929.dat C:\test>head 1054929.dat 100112321443 100135127486 100110839892 100098464584 100098900542 100048844759 100090430059 100018238859 100132791659 100027638968 C:\test>1054929 1054929.dat | wc -l 1379647642.87527 1379647855.6311 113526721
That's processing the 12-digit numbers at a rate of just under 1 million per second.
You cannot even load the data into the DB at 1/100th of that rate; never mind get the de-duped back out.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: DBI::SQLite slowness
by McA (Priest) on Sep 20, 2013 at 04:01 UTC | |
by Anonymous Monk on Sep 20, 2013 at 04:59 UTC | |
by Anonymous Monk on Sep 20, 2013 at 13:10 UTC | |
by BrowserUk (Patriarch) on Sep 20, 2013 at 05:17 UTC |